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Work  on  this  project  will  extend  previous  work  on  the  context- 
dependent  nature  of  temporal  cues  to  the  identity  of  phonetic  segments, 
and  on  the  role  of  coarse-grained  aspects  of  the  speech  signal  in 
facilitating  segment  recognition.  These  extensions  will  address  the 
following  questions:  Do  adjacent  segments  exhibit  mutual  dependencies 
resulting  in  perceptual  ambiguity  that  can  be  overcome  by  contextual 
information  present  in  coarse-signal  characteristics?  Can  coarse¬ 
grained  aspects  of  the  speech  signal,  lacking  sufficient  information  for 
segment  identification,  convey  speaking  rate  independently  of  variation 
in  the  inherent  durations  of  the  underlying  segments?  Do  coarse-grained 
aspects  of  precursive  speech  contribute  contextual  information  that  is 
used  early  in  the  timecourse  of  segment  recognition?  Can  coarse-grained 
aspects  of  the  speech  signal  direct  attention  to  the  location  of 
upcoming  stressed  syllables? 

Word  on  the  project  will  directly  study  the  nature  of  coarse¬ 
grained  aspects  of  the  signal  and  their  relation  to  processing  the 
suprasegmental  temporal  aspects  of  speech.  New  techniques  will  be 
developed  for  creating  coarse-grained  representations  of  speech  that 
eliminate  information  about  segment  identity  but  preserve  prosodically- 
relevant  aspects  of  the  speech  signal.  These  techniques  will  permit 
control  over  degree  of  resolution  in  the  short-time  spectrum  of  speech. 
Perceptual  studies,  involving  direct  judgments  on  stimuli  with  varying 
amounts  of  spectral  resolution,  will  be  performed  to  determine  what  the 
amount  of  spectral  detail  that  is  necessary  for  perceiving  important 
temporal  components  of  prosody. 

As  part  of  the  project  a  computer  simulation  will  be  developed 
that  will  test  the  computational  adequacy  of  the  processes  that  are 
hypothesized  to  underlie  human  perception  of  the  temporal  properties  of 
speech.  This  model  will  address  three  related  issues:  the  segmentation 
of  speech  into  syllables,  the  use  of  temporal  relations  between 
syllables  to  generate  expectancies  about  the  temporal  properties  of 
upcoming  syllables,  and  the  contextual  modulation  of  feature  analyzers 
for  processing  temporal  cues  to  segment  identity. 
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Abstract 

Four  experiments  addressing  the  role  of  attention  in  phonetic  perception  are  reported. 
The  first  experiment  shows  that  the  relative  importance  of  two  cues  to  the  voicing  distinction 
changes  when  subjects  must  perform  an  arithmetic  distractor  task  at  the  same  time  as 
identifying  a  speech  stimulus.  The  voice  onset  time  cue  loses  phonetic  significance  when 
subjects  are  distracted,  while  the  FO  onset  frequency  cue  does  not.  Tlie  second  experiment 
shows  a  similar  pattern  for  two  cues  to  the  distinction  between  the  vowels  /i/  (as  in  "beat”)  and 
/!/  (as  in  ''bit").  When  distracted,  listeners  attach  less  phonetic  significance  to  formant  patterns 
while  there  is  a  net  increase  in  the  phonetic  significance  attached  to  vowel  duration.  Together 
these  expenments  indicate  that  careful  attention  to  speech  perception  is  necessary  for  strong 
acousnc  cues  (voice-onset  time  and  formant  patterns)  to  achieve  their  full  phonetic  impact, 
while  weaker  acoustic  cues  (FO  onset  frequency  and  vowel  duration)  achieve  their  full  phonetic 
impact  without  close  attention.  Experiment  3  shows  that  this  pattern  is  obtained  when  the 
distractor  task  places  little  demand  on  verbal  short-term  memory.  Experiment  4  provides  a 
large  data  set  for  testing  formal  models  of  the  role  of  at.  af*on  in  speech  perception.  Attention 
is  shown  to  influence  the  signal-to-noise  ratio  in  phonetic  encoding.  This  principle  is 
mstantiated  in  a  network  model  in  which  the  role  of  attention  is  to  reduce  noise  in  the  phonetic 
encoding  of  acoustic  cues.  Implications  of  Ais  work  foi  understanding  speech  perception  and 
general  theories  of  the  role  of  attention  in  perception  are  discussed. 
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A  basic  goa|  of  research  in  speech  perception  is  to  understand  the  relation  between 
characteristics  of  the  acoustic  signal  and  our  phonetic  percepts.  The  range  of  possible  acoustic- 
phonetic  relations  is  fundamentally  shaped  by  the  nature  of  the  human  vocal  tract  and  auditory 
system  as  well  as  the  distribution  of  sounds  within  a  language.  Even  given  these  constraints,  the 
relation  is  not  constant,  but  may  vary  as  a  function  of  factors  such  as  environmental  noise  (e  g , 
Wardrip-Fruin,  1985)  and  hearing  ability  (e.g.,  Lindholm,  Dorman,  Taylor,  &  Hannley,  1988)  In 
this  paper,  we  demonstrate  that  the  relation  also  varies  as  a  function  of  the  amount  of  attention 
chat  IS  given  to  speech  perception.  Examination  of  two  different  phonetic  contrasts  shows  that 
the  importance  of  weak  acoustic  cues  increases  relative  to  that  of  strong  acoustic  cues  when 
subjects  are  prevented  from  devoting  full  attention  to  speech  stimuli.  The  results  are 
quantitatively  well  accounted  for  by  a  model  in  which  information  from  different  acoustic  cues  is 
combmed  independently  (Oden  &  Massaro,  1978).  When  this  model  is  interpreted  in  terms  of 
statistical  decision  theory,  the  shift  in  cue  importance  can  be  seen  as  resulting  from  increased 
noise  in  encoding  the  phonetic  significance  of  acoustic  cues  when  listeners  can  not  pay  close 
attention  to  them.  This  interpretation  is  instantiated  as  a  stochastic  interactive  activation  model 
(McClelland,  1991)  in  which  the  role  of  attention  is  to  reduce  noise  in  a  pattern  recognition 
network. 

Roles  of  Attention  in  Speech  Comprehension 

It  is  commonly  said  in  introductory  lectures  that  speech  perception  is  a  subjectively  easy 
yet  computationally  difficult  task.  Tlie  computational  complexity  of  speech  recognition  is  hard 
to  dispute,  but  the  norion  that  it  is  subjectively  easy  conflicts  with  the  readily  available  intuition 
that  at  least  in  some  circumstances  (e.g.,  noisy  environments  and  unfamiliar  accents)  recognition 
of  speech  is  subjecrively  demanding.  This  experience  is  consistent  with  our  professional 
experience  listening  analytically  to  speech.  >J^en  careful  attention  is  not  given  to  this  task, 
important  aspects  of  the  acoustic-phonetic  pattern  may  escape  notice.  This  suggests  that 
perceiving  the  phonetic  significance  of  some  acoustic  cues  may  require  attention  and  therefore 
that  the  ultimate  phonetic  perception  of  a  complex  of  acoustic  cues  may  depend  on  how  much 
attention  is  given  to  the  stimulus.  The  goal  of  the  present  investigation  is  to  test  the  validity  of 
this  suggestion  by  examining  whether  the  relative  importance  of  acoustic  cues  to  the  identity  of 
phonetic  segments  varies  as  a  function  of  attention  and  to  develop  a  computational  model  of 
the  role  of  attention  in  perceptual  processing. 

The  operation  of  attention  in  the  comprehension  of  spoken  language  has  been  studied 
from  many  perspectives.  Our  discussion  of  its  roles  will  be  organized  around  four  related 
topics:  (1)  aspects  of  language  that  facilitate  the  attentional  selection  of  a  specific  speech  signal, 
(2)  the  timecourse  of  attentional  selection,  (5)  capacity  and  bottleneck  explanations  of 
attention,  and  (4)  attentional  effects  on  basic  auditory  processes  that  may  precede  speech 
recogmnon.  The  extensive  nature  of  attentional  effects  that  have  been  demonstrated  suggest 
that  attention  may  also  be  operative  in  the  recognition  of  phonetic  segments  -  the  domain  of 
present  interest.  How'ever,  there  appears  to  be  little  previous  evidence  that  bears  directly  on 
this  issue. 

The  study  of  the  attentional  selection  of  a  single  speech  signal  from  a  background  of 
competing  signals  and  noise  played  a  fundamental  pan  in  the  development  of  modern  theories 
of  attention.  This  work,  inspired  by  Cherry’s  (1953)  shadowing  technique,  has  provided  a  great 
deal  of  evidence  about  the  kinds  of  distinctiveness  that  can  form  a  basis  for  attentional  selection 
Distinctiveness  at  the  following  levels  of  language  have  been  found  to  facilitate  selection- 
location  of  source  as  cued  by  binaural  disparity  (Cherry,  1953),  amplitude  (Egan,  Cancrette,  & 
Thwing,  I954j,  fundamental  frequency  (Darwn  &  Bethell-Fox,  1977),  and  semantic  conrinuity 
(Treisman,  1964).  Tlie.se  results,  summarized  clearly  in  Brcgman  (1990),  suggest  that  attention 
can  seize  on  many  low  level  aspects  of  an  acoustic  signal  as  well  as  higli-level  semantic  aspects  of 
language. 
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The  tiineqourse  of  selection  provides  additional  evidence  about  the  operation  of 
attention  m  language  comprehension.  Uoing  shadowing  methodology  ,  this  issue  has  been 
studied  by  abruptly  stopping  the  language  input,  and  asking  the  listener  to  report  as  much  as 
possible  from  the  unattended  channel  (e.g.,  Bryden,  1971;  Glucksberg  &  Cowen,  1970). 
Listeners  are  able  to  report  accurately  a  few  seconds  of  material  from  the  unattended  channel, 
indicating  that  the  input  signal  -  in  some  form  -  is  stored  temporarily  before  being  lost  due  to 
lack  of  attentive  processing.  The  speech  signal  is  thought  to  be  stored  in  a  temporary  auditoiy 
memory  or  Precategoncal  Acoustic  Store  (PAS,  Crowder  &  Morton,  1969)-  Phonetic  recognition 
is  thought  of  ^  a  labeling  of  the  information  that  is  held  in  this  memory.  Research  on  this  topic 
using  delayed  discriminarion  tasks  (e.g.,  Crowder,  1982,  Pisoni,  1973;  1975)  and  the  suffix  effect 
(e.g.,  Crowder  &  Morton,  1969)  has  generally  been  concerned  with  the  accessibility  and 
persistence  of  information  in  auditory  form  rather  than  with  the  process  of  phonetic  labeling 
and  its  possible  dependence  on  attentive  processing.  However,  the  results  from  dichotic 
listening  tasks  which  show  that  information  can  be  recalled  from  the  unattended  channel 
suggest  that  attention-dependent  processing  plays  a  role  at  some  processing  step  between 
auditory  memory  and  the  report  of  a  linguistic  label. 

The  above  characterization  can  be  seen  as  implying  that  language  comprehension 
reaches  some  level  of  processing  without  attention,  but  that  there  is  a  critical  point  beyond 
which  attention  is  necessary.  This,  of  course,  raises  a  classic  question  in  attention  research;  Are 
attentional  limits  well  characterized  by  a  bottleneck  in  processing  or  by  some  general  capacity 
limits?  This  question  has  been  central  to  large  literatures  on  the  subject  of  attention  (see  e.g., 
Kahneman,  1973;  Pashler,  1989)-  Some  studies  in  both  the  bottleneck  and  capacity  traditions 
have  direct  relevance  to  the  question  at  hand. 

Bottleneck  explanations  of  attention  immediately  provoke  the  question  of  whether  the 
limit  is  early  or  late  in  processing  (Broadbent,  1958;  Deutsch  &  Deutsch,  1963;  Duncan,  1980; 
Treisman  &  Geffen,  1967;  and  on).  The  occurrence  of  late  selection  could  be  taken  as  evidence 
agamst  the  idea  that  attentive  processing  is  necessary  at  the  stage  of  acoustic-phonetic  mapping. 
If  it  were,  then  an  unattended  signal  would  never  reach  a  lexical  level  of  encoding  that  is 
dependent  on  some  segmental  pattern  recognition.  In  fact,  research  in  the  late-seleaion 
tradition  by  Shiffrin,  Pisoni  and  Castaneda-Mendez  (1974)  seems  to  suggests  that  attention  has 
no  effect  on  the  perception  of  phonetic  segments.  They  used  the  simultaneous  vs.  successive 
technique  of  Shiffrin  and  Gardner  (1972)  to  see  whether  recognition  of  consonants  improved 
when  listeners  knew  the  ear  to  which  a  stimulus  would  be  presented.  This  knowledge,  available 
in  the  successive  condition  but  not  in  the  simultaneous  condition,  had  no  effect  on 
performance.  Shiffrin  et  al.  (1974)  interpreted  this  finding  as  indicating  that  attentional 
Imitations  are  post-perceptual  and  involve  control  processes  associated  with  shon-term 
memory;  a  conclusion  that  is  consistent  with  other  findings  using  the  simultaneous-successive 
procedure  (Duncan,  1980;  Shiffrin  &  Gardner,  1972,  Shiffrin  &  Grantham,  1974).  However,  this 
finding  can  nqt  be  taken  as  definitive  for  at  least  a  couple  of  reasons.  First,  attention  may  play  a 
role  m  phonetic  recognition  at  a  level  other  than  the  selection  of  the  physical  channel  (ear)  to 
which  a  stimulus  is  presented.  Second,  subsequent  research  using  the  simultaneous-successive 
paradigm  (Kleiss  &  Lane,  1986)  has  shown  that  successive  advantages  are  found  when  the 
required  perceptual  discriminations  are  very  fine,  calling  into  question  the  general  conclusion 
from  the  original  work  that  attentional  limits  are  post-perceptual. 

Conceptualizations  of  attention  as  limited  capacity  also  provide  relevant  information 
concerning  the  role  of  attention  in  speech  comprehension.  Luce.  Feustel  and  Pisoni  (1983) 
found  that  recallmg  synthetic  speech  placed  greater  demands  on  the  central  capacity  associated 
with  short-term  menaory  than  did  recalling  natural  speech.  'W'liile  the  complexity  of  the  task 
used  by  Luce  et  al.  (1983)  makes  it  difficult  to  pin  down  the  exact  nature  of  the  increased 
demands  of  the  synthetic  speech,  it  only  differed  from  the  natural  speech  in  its  acoustic  quality , 
which  suggests  that  the  observed  effect  was  in  pan  perceptual.  The  idea  of  capacity  limiutions 
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has  also  motivated-studies  examining  whether  prosody  guides  attention  to  significant  locations 
in  the  speech  sigh^  so  that  they  receive  mOre  extensive  processing  (Martin,  1972).  A  variety  of 
experiments  (Buxton,  1983;  Cutler,  1976,  Metzler,  Martin,  Mills,  Imhdff  &  Zohm",  1976;= Shields, 
McHugh  ^  Martin,  1974,  cf.  Mens  &  Povel,  1986)  have  shown  that  segments  in  prosodically 
predictable  locations  (mostly  stressed  syllables)  are  recognized  more  rapidl>  in  phoneme- 
monitoring  tasks.  As  with  the  results  of  Luce  et  al.  (1983),  these  findings  are  consistent  with  the 
idea  that  attennomplays  a  role  phonetic  perception,  but  their- complexity  makes  it  difficult  to  pin 
down  the  exact  nature  of  the  observed  effects. 

A  final  area  of  anentional  research  to  consider  is  the  role  of  attention  inibasic  auditory 
detection.  It  has  been  kno^  for  some  time  that  listeners  camfocus  attention  at  a  pre-specified 
frequent^,  and  that  they  are  very  poor  at  detecting  near-threshold  tones  that  are  more  than  a 
critical  band  away^from  the  frequency  at  which  they  are  attempting  to  detect  gtone  (Greenberg 
&  Larkin,  1968;  Scharf,  Quigley,  Aoki,  Peachey  &  Reeves,  1987;  Swets,  1963;  1984).  The 
potential  relevance  of  this  phenomenon  to  the  role  of  attention  in  phonetic  recognition 
ciepends  on  one’s  view  of  the  relation  between  speech  perception  and  audition.  A  prominent 
view  of  this  relation  holds  that  the  processes  underlying  speech  perception  are  distinct  from 
those  of  basic  auditory  perception  (Dbermah,  Cooper,  Shankweiler,  &  Studdert-Kennedy,  1967; 
Dberman,  1982;  liberman  &  Mattin^y,  1985).  However,  a  substantial  nuniber  of  researchers 
have  begun  to  argue  that  the  nature  of  auditory  processing  does  affea  phonetic  perception 
(e.g.,  Diehl,  1987;  Klatt,  1982;  Lindblom,  1986).  If  this  view  is  accepted,  then  effects  of  attention 
on  auditory  detection  show  that  attention  operates  at  a  level  of  processing  earlier  than  phonetic 
recognition.  This  complements  the  bulk  of  findings  review'ed  earlier  that  can  be  interpreted 
conservatively  as  indicating  that  attentional  resources  are  used  in  the  post-perceptual  processing 
of  speech  .  Thus,  existing  research  points  to  roles  for  attention  in  phases  of  spoken  language 
processing  that  occur  both  earlier  ^d  later  than  the  recognition  of  phonetic  segments,  without 
providing  compelling  evidence  about  whether  it  plays  a  role  at  that  level. 


Relative  Importance  of  Acoustic  Cues  to  Segment  Identity 

Acoustic-phonetic  research  has  been  concerned  in  large  measure  with  determining  what 
acoustic  cues  have  phonetic  significance  perceptually.  It  has  long  been  clear  that  phonetic 
distinctions  are  not  cued  by  a  single  acoustic  characteristic,  but  rather, that  many  aspects  of  the 
acoustic  signal  contribute  to  people’s  perception  of  speech  sounds.  In  a  famous  example, 
Lisker  (l978)  listed  16  acoustic  characteristics  that  may  contribute  to  the  perception  of  the 
vbicmg  disnnction  m  inter-vocalic  stop  consonants.  Accompanying  this  sort  of  enumeration, 
there  has  been  considerable  research  and  dgbate  about  what  acousticTeatures  are  most 
important  in  the  recognition  of  certain  phonetic  distinctions.  One  difficulty  in  assessing  the 
relative  importance  of  acoustic  cues  stems  from  methodological  difficulties  in  constructing 
stimuli  that  allow  assessment  of  the  phonetic  importance  of  acoustic  cues.  A  major  point  of 
Lisker’s  (1978)  paper  was  that  some  perceptual  impact  could  be  found  for  nearly  any  acoustic 
correlate  of  a  phonetic  distinction  if  il  the  other  correlates  of  the  distinction  were  neutralized. 
However,  effects  of  some  of  these  acoustic  dimensions  could  not  be  found  if  other  acoustic 
dimensions  were  given  more  realistic  values.  For  example,  Shinn,  Blumstcin  &  Jongman  (^1985) 
have  argued  that  context-dependent  cues  contribute  little  to  per..eprion  if  context-invari'ant  cues 
are  present  m  the  expenmental  stimuli.  However,  Nittrouer  and  Studden- Kennedy  (1986) 
Cogently  challenged  the  naturabsm  of  the  Shinn  et  al.  (1985)  stimuli,  providing  fonher  reason  to 
believe  that  results  obtained  in  this  sort  of  experiment  may  be  quite  situation-specific.  Other 
debates  about  the  perceptual  importance  of  various  acoustic-phonetic  relations  can  be  seen  as 
resulting  in  good  part  from  the  difficulty  of  preserving  the  natural  interdependencies  among 
acoustic  dimensions  while  systematically  manipulating  those  dimensions  in  experimental  stimuli 
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I  , 

The  presehtinvestigation  of  the  effect  off  aneation  on  the  phonetic-encoding  of  acoustic 
cues  raises  additional  questions  about  the  generality  of  experimental  results  in  speech 
perception.  Most  research  of  this  sort  involves  listening  conditions  that  are  near  optimal  in 
terms  of  the  amouht  of  attention^that  is:given  to  phonetic  encodingiof  acoustic  cues.  Subjects 
are  Q^ically.asked  to  do  nothingibut  listen  to  the  speech  sounds  arid  mustiusually  attend  only 
to  a  specificisegment.  This  situation  contrasts  considerably  wth  the  conditions  under  which 
people  often  perceive  speech,  where  they  may^be  simultaneously  performing  another  task  ^such 
as  driving  a  car)  arid  where  they  ^e  almost  certainly  focusing  on  the  meaning  of  a 
comiiiunicanon  rather  than  the  identity  of  a  single  pre-specified  phonetic  segment.  This 
prompts  the  concern  that-some  of  the  effects  that  have  been  observed  with  close’listening 
conditions  ^e  limited  to  laboratory  conditions.  However,  it  also  prompts  the  hope  that 
studying  the  change  in  relative  phonetic^  importance  from  focused  to  unfocused  attention  will 
provide  information  about  what  cues  are  naturally  more  salient  in  conditions  of  unstudied 
listei^g. 


A  more  fundament  reason  for  studying  the  role  of  attention  in  phonetic  encoding  is 
that  it  is  a  general,;higher-Ievel  perceptual  process  that  may  play  a  role  in  shaping  the  acoustic- 
phonetic  patterns  of  languages.  As  noted  earlier,  a  number  of  factors  have  been  found  to 
influence  the  relative  phonetic  importance  of  acoustic  cues.  These  include:noise;(e.g.,  Wardrip- 
Fruiii,  1985),  hearing  disability  (e.g.,  lindholm  et  al.,  1988),  early  development  (e.g.,  Bernstein, 
1983)  and  late  development  (e.g..  Price  &  Simon,  1984).  However,  the  operation  of  attention 
differs  from  these  other  faaors  iii  that  it  is  an  always  present  propem  of  a  person.  In  contrast, 
noise  is  not  a  human  property,  nor  is  it  always  present.  Hearing  disabilities  and  development 
are  human  dimensions  but  they  ^e  not  always  operative.  With  reg^d  to  these  properties, 
attention  is  more  on  a  par^^with  the  structure  of  the  vocal  tract  or  the  basic  auditory  system 
which  have  iong  been  conside.ed  to  have  a  fundamental  role  in  shaping  the  acoustic-phonetic 
pattern  of  language.  The  present  rese;frch  investigate  whether  attention  level  plays  a  role  in 
shaping  acoustic-phonetic  relations  by  examining  whether  it  differentially  affects  the  phonetic 
importance  of  strong  and  . weak  acoustic  cues  to  the  identity- of  phonetic  segments.  It  seems 
likely  that  differences  in  the  inherent  strength  of  cues  reflect  the  cumulative  influences  on 
acoustic-phopetic  patterns,  including  possible  consequences  of  naturally-occuring  variation  in 
attention  level. 


Experiment  1 

This  experiment  examines  whether  the  amount  of  attention  that  is  allocated  to  speech 
perception  influences  the  relativc  importance  of  two  acoustic  cues  to  the  voicing  distinction 
between  the  consonants  /b/  and  /p/.  The  amount  of  attention  available  for  speech  perception 
is  manipulated  by  sometimes  having  subjects  perform  a  visually-presented  arithmetic  distraaor 
task  while  the  speech  stimulus  is  presented.  The  two  cues  to  consonant  voicing  are  voice-onset 
time  (VOT)  and  the  onset  frequency  of  the  fundtunental  (FO).  VOX  is  the  time  between  the 
release  of  a  consonant  and  the  oiiset  of  phonation.  Voiced  stop  consonants  like  /b/  have  short 
VOTs  (0  to  10  msec)  while  voiceless  sounds  like  /p/  have  long  VOTs  (50  to  70  msec)  (Lisker  & 
Abramson,  1964).  In  addition,  voiced  consonants  tend  to  have  a  lower  onset  frequency  of  FO 
than  do  voiceless  consonants. 

In  comparing  the  importance  of  these  two  cues,  Abramson  and  lisker  (1985)  have 
argued  that  VOT  is  the  primary  cue  to  voicing  because  the  onset  frequency  of  FO  has  a  strong 
effect  on  perceptual  judgments  only  when  VOT  is  ambiguous.  Further  e\idence  of  the  greater 
importance  of  VOT  comes  from  Bernstein's  (1983;  finding  that  perceptual  judgments  by  young 
children  are  consistently  influenced  by  VOT  but  not  by  FO.  Tlius,  the  use  of  these  two  cues 
allows  us  to  examine  the  effect  of  attention  on  the  relative  importance  of  acoustic  cues  when 
one  of  the  cues  is  strong  and  the  other  is  weak.  Examining  the  perception  of  these  cues  under 
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low  levels  of  attention-^lows  us  to  see  whether  weak  cues  achieve  their  mopest  phonetic 
contribution  because  of  the  close  anention  demandedlof  subjects  in  the  typical  speech 
experiment  or  whethenstrong  cues  achieve  their  large  impact  because  of  close  attention. 

Method 


Subjects.  Twelve  students  at  Harvard  University  were  tested  individually  in  a  single  hour- 
long  session.  TTiey  received  b^e  pay  of  $4.00  plus  a  bonus  of  up  to  $3-00  depending  on  their 
speed  and  accuracy  in  the  arithmetic  distractor  task.  All  subjects  were  native  speakers  of  En^ish 
who  reponedhaving  normalhearing. 

Stimuli.  The  stimuli  were  created  with  the  Klatt  (1980)  synthesizer  and  varied  along  two 
dimensions:  VGT  and  onset  firequency  of  FO.  A  silent  VOT  interval,  foUowing  an  initial  burst, 
ranged:from  0  to  70  msec  in  10  msec  steps.  The  formant  transition  charactenstics  were 
appropriate  foca  labial  place  of  articulation,  and  the  steady  state  formant  frequencies  were 
appropriate  for  /a/.  Two  onset  frequencies  of  FO,  100  Hz  and  150  Hz,  were  used.  This 
frequency  was  changed  in  a  linear  f^hion  to  125  Hz  over  the  first  50msec  of.voicing.  All 
characteristics  of  these  stimuli,  other  than  FO,  were  taken  from  McClaskey,  Pisoni  and  Carrell' 
(1983). 

Design.  The  experiment  included  one  practice  block  and  ten  experimental  blocks  of  32 
trials  each.  In  the  practice  block,  subjects  performed  only  the  arithmetic  distractor  t^k.  Half  of 
the  experimental  blocks  were  conduaed  in  the  distractor  conditions  which  subjects  had  to 
both  perform  the  arithmetic  task  and  recognize  a  speech  sound.  The  other  blocks  were 
conducted  in  the  no-distractor  condition  where  subjects  only  had  to  recognke  the  speech 
sound.  The  experimental  blocks  alternated  betR-een  the  distractor  and  no-distractor  conditions, 
with  all  subjects  starting  in  the  distractor  condition.  Each  experimental  block  included  four 
presentations  of  each  of  the  eightiVOT  values  in  a  random  sequence.  FO  onset  frequency  was 
manipulated  across  pairs  of  distraaor  and  no-distractor  bloclc.  Half  of  the  subjects  began  with 
a  low-frequency  FO  onset  and  the  other  half  with  a  high-frequency  FO.onset. 

Procedure.  Subjects  initiated  a  trial  by  clicking  a  mouse  button.  At  the  start  of  the  trid 
two  fixation  lines  appe^ed  on  the  computer  screen  followed  by  the  visual  test  stimulus.  For- the 
practice  block  and  for  blocks  in  .the  distractor  condition,  the  visual  stimulus  consisted  of  three 
two-digit  numbers  which  were  all  multiples  of  ten.  Subjects  were  asked  to  decide  whether  the 
difference  between  the  first  arid  second  numbers  was  the  same  as  the  difference  between  the 
second  and  third  numbers.  They  were  told  to  respond  as  quickly  and  accurately  as  possible  by 
clicking  an  appropriate  mouse  bunon.  The  number  oftrials  requiring  affirmative  and  negative 
responses  was  equal  for  each  block.  Immediately  after  each  distraaor.  block,  feedback  was 
given  on  speed  and  accuracy  of  response  in  the  arithmetic  task,  and  points  were  awarded  based 
On  speed  and  accuracy.  The  amount  Of  bonus  money  that  subjects  received  was  based  on  the 
number  of  points  they  earned. 

In  distractor  blocks  other  than  the  practice  block,  a  speech  sound  was  presented  500  msec 
after  the  appearance  of  the  numbers.'  Tlie  spce«.h  sound  was  presented  over  headphones  at  a 
comfoimble  listemng  level.  After  the  subjects  had  made  a  response  in  the  number  task,  they 
were  pfompted=to  ideritify  the  speech  sound  by  a  "b"  and  a  "p”  appearing  on  the  computer 
screen.  Tlie  subjects  were  told  to  ideriiifr  the  speech  sound  as  accurately  as  possible  and  that 
speed  was  not  important.  Subjects  were  told  that  thejr  bonus  would  depend  only  on  their 
perfonnance  on  the  anthmetic  task  and  that  they  should  treat  it  as  primary.  At  the  end  of  each 
distractor  block,  the  experimenter  showed  the  subject  his  other  number  of  errors  arid  mean  RT 
for  thciblock.  The  expenmenter  then  encouraged  the  subject  to  try  hard.  Feedback  was  ^ven 
for  the  arithmetic  task  only. 
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The  next  experimental  test  block  was  presented  in  the  no-distractor  copdirion.  In  this 
condition,  three  pairs  ofizeros  appeared  on  the  computer  scxeen  as  the  speech  sound  was 
presented.  The  duration  that  the  visual  stimulus  was  displayed  was  derived  from  the  subject's 
average  response:time  to  the  number>task  in  the  previous  block  of  the  distractor  condition. 
After  the  visual  stimulus,  the  subjccts  were  prompted  to  respond  to  the  auditory  stimulus  by  a 
"b”  ^d  a  "p"  appealing  on  the  computer  screen.  Subjects  were  again  told  to  try  to  respond 
accurately,ralthough  speedy  responses  were  not  necessary. 

Results 


Distractor  Task PeifortTiance.  Figure  1  show’s  the  mean  response  times  (of  correct 
responsesyjand  accuracies  in  the  distractor  task  as  a  function  of  the  characteristics  of  the  speech 
sound  presented  on  a  trial.  Response  times  varied  significantly  as  a  function  pf  the  interaction 
of  VOT  and  FO  onset  frequency;  F(7,77)  =  15.6,  p  <  .001.  The  fastest  response  times  occurred 
when  VOT  and  FO  provided  congruent  cues  to  phoneric  segment  identity,  t(ll)  =  5.7,  p  <  ^001 
for  the  linear  interactioii.  This  occurs  for  stimuli  with  shon  VOTs  and  low  FO,  and  for  stimuli 
withdong  VOTs  and  high  FO.  Response  times  weredongest  when  VOT  and  FOrprovided 
incongruent  cues  to  segment  identity,  i.e.,  short  VOTs  paired  with  high  FO  and  long  VOTs  paired 
■withdowFO. 


Identification  of  Speech  Stimuli.  Figure  2  shows  Usteners'  identifications  of  the  speech 
stimuli  as  a  function  of  VOT,  FO  onset  frequency,  and  distractor  condition.  For  the  distractor 
blocks  in  this  and  subsequent  experiments,  speech  identification  responses  were  excluded  if  the 
response  in  the  distractor  task  was  incorrect.  As  would  be  expected  given  previous  results, 
there  w'ere.significant  main  effects  of  VOT  (F(7,77)'=  87.0,  p  <  .001)  and  FO  onset  frequenq* 
(F(l,ll)  =  170.2,  p  <  .001)  on  judgments  of  the  stimuli.  As  would  iso  be  expected,  there  was 
a  significant  interacrion  of  VOT  and  FO  onset  frequency;  F(7,77)  =  29-0,  p  <  -001.  FO  onset 
frequency  had  a  greater  impact  at  intermediate  values  of  VOT  (20  to  50  msec)  than  near  the 
endpoints.(0,10, 60  and.70  msec),  t(ll)  =  12.3,  p  <  .001, 

A  significant  interaction  was  found  between  distractor  condition  and  the  effect  of  VOT 
on  judgments  of  the  speech  stimuli;  F(7, 77)  =  12.9,  p  <  .001.  As  can  be  seen  by  comparing  the 
left  and  right  panels  of  Figure  2,  the  effect  of  VOT  on  identification  was  stronger  in  the  no- 
distjractor  condirion  than  in  the  distractor  condition.  The  distractor  task  did  not  have  a  similar 
impact  on  the  FO-onset-frcquency  cue  to  voicing.  In  the  no-distractor  condition,  stimuli  with  an 
FO  onset  of  100  Hz  produced  32  percent  more  /b/  responses  than  did  those  with  a  150  Hz  FO 
onset,  while  in  the  distractor  condition,  the  analogous  difference  was  36  percent.  However,  this 
increase  in  the  importance  of  FO  onset  frequency  in  the  distractor  condition  was  not  significant 
F(l,ll)  =  1.6,  p  >  .20. 

A  significant  interaction  was  found  between  the  three  faaors  of  distractor  condition, 

VO'f  and  FO  onset  frequency;  F(7,77)  =  2.8,  p  <  .02.  Figure  3  makes  the  form  of  this  interaction 
app^ent.  It  show's  the  difference  in  the  proportion  /b/  identifications  between  stimuli  wdth  FO 
onset  frequencies  of  100  Hz  and  150  Hz,  broken  down  by  VOT  md  distractor  task.  This 
difference  is  an  indication  of  the  phonetic  importance  of  fundamental  frequency.  Tlie  bowed 
shape  of  the  two  lines  makes  it  clear  that  FO  has  its  greatest  significance  at  intermediate  lev  els  of 
VOT.  This  pattern,  however,  is  more  pronounced  for  the  no-distractor  condition  than  for  the 
distractor  condition.  In  panicular,  FO  onset-frequency  had  a  greater  effect  at  the  VOT  endpoints 
(0  ^d  70  msec)  in  the  distractor  condition  than  in  the  no-distractor  condition,  t(ll)  =  3-2, 

p  <  .01. 


Discussion 
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The  results  of  the  experiment  showed  that  performance  on  each  task  was  significantly 
influenced  by  the  other  task.  This  w^as  somewhat  unexpected  in  the  case  of  the  speed  of  the 
anthmetic  distractor  task  depending  on  the  congruence  of  the  tw  o  av.oustic  cues  to  the  speech 
sound.  This  finding  was  unexpected  because  the  instructions  to  the  subject  emphasized  that 
the  anthmetic  task  was  primary  and  that  they  should  demote  their  full  effort  to  it.  However 
unexpected,  the  finding  is  consistent  with  the  idea  that  the  speech  identification  and  distractor 
tasks  involve  a  sh^ed  limited  capacity .  The  finding  that  the  magnitude  of  the  interference 
depended  on  the  congruence  of  the  two  acoustic  cues  to  voicing  indicates  that  the  process  of 
idennfying  phonetic  segments  consumes  more  capacity  when  there  is  conflicting  stimulus 
information,  and  that  this  influences  a  concurrent  task.  Unfortunately,  not  much  can  be  made  of 
this  specific  finding  because  it  is  not  repeated  in  the  next  three  experiments. 

The  manner  in  which  the  distractor  task  influenced  the  phonetic  importance  of  the  VOT 
and  FO-onset-frequency  cues  to  voicing  is  more  central  to  the  present  concern.  The 
effectiveness  of  VOT  as  a  cue  to  the  voicing  distinction  of  /ba/  vs.  /pa/  was  reduced  in  the 
distraaor  condition  as  compared  to  the  ncwlistraaor  condition.  One  interpretation  of  this 
finding  is  that  the  importance  of  VOT  for  phonetic  perception  is  reduced  when  close  attention 
can  not  be  given  to  the  stimulus.  A  less  interesting  possibility  is  that  the  ability  to  identify  the 
spee^ch  sound  was  disrupted  in  generd  by  the  distractor  task.  This  possibility  is  ruled  cut  by  the 
results  obtained  for  FO  onset  frequency.  The  phonetic  importance  of  FO  was  not  diminished  by 
simultaneous  performance  of  the  distractor  taski  in  fact,  there  a  non-significant  increase  in 
its  importance.  This  indicates  that  the  distraaor  task  did  not  simply  produce  an  overall 
deaement  m  listeners’  ability  to  idenrify  the  speech  stimulus,  but  rather  that  the  deaement  in 
the  importance  of  VOT  was  specific  to  that  aspea  of  the  stimulus.  The  phonetic  significance  of 
FO  onset  frequency  does  not  appear  to  depend  on  the  ability  to  pay  close  attenfion  to  the 
stimulus.  In  addition,  FO  onset  frequency  had  a  significant  effea  on  judgments  of  voicing  even 
when  VOT  was  unambiguous  (cf.  Abr^son  &  Lisker,  1985)-  This  effea  of  FO  onset  frequency 
at  the  VOT  endpoints  increased  when  listeners  were  prevented  from  devoting  full  attention  to 
the  speech  stimulus. 

The  change  in  the  relative  importance  of  VOT  and  FO  onset  frequency  provides  a  first 
answer  to  our  question  concerning  the  importance  of  attention  in  the  encoding  of  strong  and 
weak  phonetic  cues.  On  the  basis  of  these  results,  it  appears  that  strong  cues  achieve  their 
commanding  phonetic  importance  through  careful  attention  to  the  stimulus.  Weak  cues  can 
achieve  their  modest  contribution  even  without  careful  attention.  Of  course,  this  interpretation 
presumes  diat  there  are  not  characteristics  particular  to  the  proc^ing  of  VOT  and  FO  onset 
frequency  that  make  VOT  more  dependent  than  FO  onset  frequency  on  attentive  processing. 


Experiment  2 

This  experiment  has  two  goals,  to  test  the  generality  of  the  previous  result  concerning 
the  greater  dependence  of  strong  cues  than  weak  cues  on  anentive  processing  and  to  test  a 
specific  hypothesb  about  how  the  lack  of  attention  affects  phonetic  encoding.  This  is  done  by 
examining  tw'o  acoustic  cues  to  the  distinction  between  the  vowels  /i/  (as  in  "beat”)  and  /I/  (as 
m  "bit”).  Tile  two  cues  are  formant  pattern  (for  /if  the  first  formant  is  lower  and  the  second  and 
third  formants  higher  than  for  /I/)  and  duration  (/i/  tends  to  be.longer  than  /I/).  Several 
sources  of  evidence  indicate  that  formant  pattern  can  be  considered  the  stronger  or  primary 
cue,  while  duration  can  be  considered  a  weaker,  secondary  cue.  Formant  pattern  depends  on 
the  shape  of  the  vocal  tract,  which  is  the  articulatory'  characteristic  most  uniquely  related  to 
vowel  identity.  V'owel  duration  depends  on  the  amount  of  time  that  a  vocal  traa  shape  is 
maintained,  or  more  likely  on  the  rate  at  which  the  vocal  tract  approaches  and  moves  away  from 
a  target  configuration.  '\S’hile  vowels  do  differ  on  average  in  their  inherent  durations  (Peterson 
&  Lehiste,  I960),  these  durations  also  depend  on  a  large  number  of  other  factors  such  is  overall 
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speaking  rate,  prosodic  patterns,  and  identity  of  neighboring  segments  (Gordon,  1989;  Klatt, 
1976;  Peterson  &  Lehiste,  I960).  Formant  patterns  are  also  subject  to  a  variety  of  contextual 
effects  (e.g.,  coarticulation,  reduction  and  vocal-tract  differences  between  speakers),  but  these 
influences  are  less  drastic  than  those  that  operate  on  inherent  vowel  duration.  Perceptual  data 
suppon  this  analysis  of  formant  pattern  as  the  stronger  cue.  Vowel  durarion  tends  to  influence 
subject’s  perceptual  judgments  only  when  formant  pattern  is  ambiguous  (Pisoni,  1975).  If  the 
pattern  of  results  in  the  previous  experiment  has  been  correctly  interpreted  as  indicating  a 
dependence  of  strong,  but  not  weak,  cues  on  attentive  processing,  then  we  would  expect  that 
the  phonetic  importance  of  formant  pattern  ought  to  decrease  relative  to  that  of  duration  as 
attention  to  the  speech  stimulus  is  decreased. 

Our  second  goal  was  to  test  the  hypothesis  that  a  low  attention  level  influences  speech 
perception  by  delaying  phonetic  access  to  an  auditory  representation  of  the  speech  stimulus.  If 
this  were  the  case,  we  would  expect  a  greater  reduction  in  the  phonetic  importance  of  formant 
pattern  for  short  duration  vowels  than  for  long  duration  vowels,  because  there  would  be  less 
time  available  to  access  the  auditory  representation.  In  addition  to  the  actual  physical 
differences  in  duration  between  the  stimuli,  it  has  been  argued  that  the  persistence  of  short¬ 
term  auditory  memory  increases  with  the  duration  of  a  stimulus,  with  such  a  process  accounting 
for  why  short-duration  vowels  are  perceived  in  a  more  categorical  fashion  than  long-duration 
vowels  (Fujisaki  &  Kawashima,  1969;  1970;  Pisoni,  1971;  but  see  Pisoni,  1973;  1975;  Crowder, 
1981).  If  restricted  attention  influences  speech  perception  in  this  way,  then  we  would  expect 
that  shift's  in  the  relative  phonetic  importance  of  acoustic  cues  would  be  determined  by  their 
duration  and  their  relative  persistence  in  auditory  memory,  not  by  differences  in  inherent 
phonetic  strength. 

Method 


Subjects.  Twelve  new  subjects,  drawn  from  the  same  pool  as  the  previous  experiment, 
participated  in  a  single  hour-long  session.  Pay  was  the  same  as  for  the  previous  experiment. 

Stimuli.  The  speech  stimuli  were  created  on  the  Matt  (1980)  synthesizer  and  were 
closely  modeled  after  those  of  Pisoni  (1975).  A  seven-member  series  of  formant  patterns 
combined  with  two  vowel  durations  (300  msec  and  50  msec)  yielded  a  total  of  14  vowel  stimuli. 
The  formant  series  was  constructed  by  varying  the  center  frequencies  of  the  first  three  formants 
in  equal  logarithmic  steps  from  /i/  to  /I/  (see  Table  1).  The  fourth  and  fifth  formants  were  held 
constant  at  3500  Hz  and  4500  Hz  respectively.  The  bandwidths  of  the  first  three  formant 
frequencies  were  fixed  at  60,  90,  and  150,  respectively.  The  300  msec  and  50  msec  vowels 
differed  in  their  rise  and  decay  times.  The  rise  and  decay  times  were  50  msec  for  the  300  msec 
vowels  and  10  msec  for  the  50  msec  vowels.  For  the  300  msec  vowels,  fundamental  frequency 
fell  from  125  Hz  at  onset  to  80  Hz  at  offset  while  for  the  50  msec  vowels  fundamental  frequency 
fell  from  125  Hz  to  100  Hz. 

.  Design  and  Procedure.  The  design  was  the  same  as  the  previous  experiment  with  the 
following  exceptions.  There  was  one  practice  block  and  eight  experimental  blocks.  Each  block 
had  35  trials,  five  of  each  formant  pattern.  Duration  of  the  speech  sounds  was  manipulated 
between  blocks.  Tlie  procedure  for  the  previous  experiment  was  modified  sc  that  subjects 
were  told  to  identify  the  speech  sounds  as  /i/  as  in  "’oeet"  or  /I/  as  in  "bit",  and  were  prompted 
to  respond  by  an  "e"  or  "i"  on  the  computer  screen. 

Results 


Distractor  Task  Performance.  Figure  4  shows  mean  response  times  and  accuracies  for  the 
arithmetic  distractor  task  as  a  function  of  the  formant  pattern  and  duration  of  the  concurrently 
presented  vow  el.  For  response  times,  the  main  effect  of  vowel  duration  was  not  significant 
[F(l,ll)  =  2.3,  p  >  .15],  the  main  effect  of  formant  panern  was  significant  [F(6,66)  =  5.8, 
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p  <  .001,  and  the  interaction  of  duration  and  formant  pattern  was  not  significant  [F(6,66)  <  1]. 
Based  on  the  results  of  the  previous  experiment  a  planned  test  was  performed  on  the  linear 
interaction  of  formant  pattern  and  duration.  This  test  was  not  significant;  t(ll)  =  .4,  p  >  .2. 

Identification  of  Speech  Stimuli.  Figure  5  shows  the  mean  proportion  of  /i/  responses  as 
a  function  of  formant  pattern,  vowel  duration,  and  distractor  condition.  There  were  significant 
main  effects  of  formant  pattern  [F(6,66)  =  235.9,  p  <  .001)  and  vowel  duration  [F(l,ll)  =  47.4, 
p  <  .001],  as  well  as  a  significant  interaction  of  these  two  factors  [F(6,66)  =  18.2,  p  <  .001). 
Duration  had  a  greater  impact  at  the  intermediate  formant  patterns  than  at  the  endpoints  of  the 
continuum,  t(ll)  =  7.3,  p  <  .001. 

A  significant  interaction  was  found  between  distractor  condition  and  the  effect  of  formant 
pattern  on  identifications  of  the  speech  sounds;  F(6,66)  =  17.3,  p  <  .001.  Formant  pattern  had 
a  greater  influence  in  the  no-distractor  condition  than  in  the  distractor  condition.  There  was 
also  a  significant  interaction  between  the  effects  of  distractor  manipulation  and  vowel  duration 
on  speech  identifications;  F(l, 11)  =  13.4,  p  <  .005.  However,  in  contrast  to  what  was  observed 
for  formant  pattern,  the  effect  of  duration  was  greater  in  the  distractor  than  in  the  no-distractor 
condition;  24.196  more  /i/  responses  occurred  for  300  msec  vowels  than  for  50  msec  vowels  in 
the  distractor  condition,  while  the  difference  was  only  15.996  in  the  no-distractor  condition. 

'Fhe  three-way  interaction  of  vowel  duration,  formant  pattern,  and  distractor  task  was 
significant;  F(6,66)  =  2.57,  p  <  .05.  While  vowel  duration  always  had  its  greatest  impact  when 
formant  pattern  was  intermediate,  it  had  an  effect  on  the  extreme  formant  patterns  in  the 
distractor  condition  but  not  in  the  no-distractor  condition.  A  planned  contrast  showed  that  the 
effect  of  vowel  duration  on  the  endpoint  formant  patterns  was  greater  in  the  distractor 
condition  than  in  the  no-distractor  condition,  t(ll)  =  3.6,  p  <  .005. 

Discussion 


Performance  on  the  distractor  task  was  not  systematically  related  to  the  acoustic 
characteristics  of  the  concurrently  presented  speech  sound.  While  a  significant  effect  of  formant 
pattern  was  observed,  the  differences  underlying  this  effect  are  not  clearly  related  to  the 
progression  of  formant  patterns.  More  importantly,  there  was  not  a  significant  interaction 
between  formant  pattern  and  vowel  duration.  This  contrasts  with  the  interaction  observed  in 
the  previous  experiment  where  the  two  cues  to  voicing,  VOT  and  FO  onset  frequency,  had  an 
interactive  impact  on  response  times  in  the  distractor  task;  faster  times  where  observed  when 
the  phonetic  significance  of  the  cues  was  congruent  than  when  they  were  incongruent.  This 
discrepancy  could  be  due  to  some  difference  between  the  way  in  which  the  cues  to  voicing  and 
the  cues  to  vowel  identity  are  processed  or  integrated.  More  likely,  it  is  due  to  some 
unintended  difference  in  instructional  emphasis  in  the  two  experiments.  As  the  distractor  task 
performance  is  not  central  to  the  goals  of  this  project,  these  possibilities  are  not  pursued 
funher. 

The  results  of  the  speech  perception  task  support  and  extend  the  findings  of  Experiment 
1.  The  relative  importance  of  two  acoustic  cues  was  found  to  change  when  listeners  c  'aid  not 
devote  their  fuU  attention  to  the  speech  perception  task.  The  strong  acoustic  cue  of  formant 
pattern  decreased  in  phonetic  importance  when  listeners  were  simultaneously  performing  the 
distractor  task.  This  is  analogous  to  the  effect  observed  for  VOT  in  the  previous  experiment. 

For  the  weak  cue  of  vowel  duration,  there  was  a  significant  increase  in  importance  when 
subjects  performed  the  distractor  task.  This  is  a  similar  but  stronger  effect  than  the  non¬ 
significant  increase  observed  for  the  weak  cue  of  FO  onset  frequency  in  the  last  experiment. 
Taken  together,  the  results  of  these  experiments  support  the  following  view  of  the  relation 
between  cue  strength  and  attention.  A  stimulus  must  be  carefully  attended  to  for  a  strong 
acoustic  cue  to  realize  its  full  impact  on  phonetic  categorization.  V^Tien  listeners  are  prevented 
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from  doing  so,  the  importance  of  such  cues  will  diminish.  In  contrast,  weak  cues  depend  less 
on  attention  in  order  to  achieve  their  phonetic  impact,  and  their  net  contribution  to  listeners' 
identifications  is  not  diminished  and  may  actually  increase  when  attention  is  diverted  by  a 
competing  task. 

The  second  goal  of  this  experiment  was  to  test  the  hypothesis  that  performing  the 
distractor  task  influenced  speech  perception  by  delaying  phonetic  access  to  a  decaying  auditory 
representation  of  the  stimulus.  That  hypothesis  leaa  to  the  expectation  that  the  distractor  task 
would  impair  encoding  of  formant  information  for  50  msec  vowels  more  than  for  300  msec 
t'owels.  Examination  of  Figure  5  shows  that  this  did  not  occur.  In  the  distractor  condition, 
formant  pattern  had  at  least  as  big  an  effect  on  listeners’  identificarions  for  :he  50  msec  vowels 
as  it  did  for  the  300  msec  vowels.  Therefore,  this  e-xperiment  provides  no  support  for  the 
hypothesis  that  access  to  a  decaying  auditory  representation  is  delayed  by  the  distractor  task. 


Experiment  3 

The  goal  of  this  experiment  is  to  show  that  the  phonetic  encoding  of  acoustic  cues  can  be 
affected  by  distractor  tasks  other  than  the  arithmetic  one  used  in  the  previous  experiments.  In 
addition  to  increasing  methodological  generality,  this  experiment  will  determine  whether  an 
effect  on  speech  perception  can  be  observed  when  the  role  of  verbal  short-term  memory  in  the 
distractor  task  is  reduced.  This  provides  an  initial  step  toward  determining  the  locus  of 
processing  at  which  the  distractor  task  competes  with  speech  perception.  The  experiment 
combines  the  vowel  identification  task  of  the  previous  experiment  with  a  new  line-length 
discrimination  task. 

In  the  previous  arithmetic  distractor  task,  the  stimuli  consisted  of  a  sequence  of  three 
numbers,  and  subjects  had  to  make  a  speeded  judgment  about  t«diether  the  difference  between 
the  first  two  was  equal  to  the  difference  between  the  second  two.  At  least  initially,  this  probably 
involved  verbal  encoding  of  the  numbers,  calculation  of  the  two  differences,  and  comparison  in 
short-term  memory.  As  subjects  became  practiced  in  the  task,  it  is  possible  that  some  of  this 
became  automatic  and  that  the  role  of  verbal  short-term  memory  was  reduced.  The  line-length 
discrimination  used  in  the  present  experiment  was  designed  so  as  to  reduce  as  much  as  possible 
the  role  of  verbal  processing  of  the  distractor  stimuli.  This  was  done  by  presenting  subjects 
with  two  vertical  lines  and  asking  them  to  make  a  speeded  judgment  as  to  whether  the  one  on 
^he  left  or  the  right  was  longer.  As  the  relevant  stimulus  characteristics  were  difficult  to  encode 
verbally  and  the  stimuli  were  present  until  a  response  was  made,  it  seems  likely  that  this  task 
placed  fewer  demands  on  verbal  encoding  or  short-term  memory  than  the  arithmetic  task  did. 

Method 


Subjects.  Twelve  new  subjects  from  the  same  population  as  the  previous  experiment 
served  as  paid  subjects  in  a  sin^e  hour-long  session. 

Stimuli,  Design  and  Procedure.  The  speech  stimuli  were  the  vowel  sounds  used  in 
Experiment  2.  The  general  procedure  was  the  oame  as  in  the  previous  experiment  except  for 
the  new  distractor  task.  For  this  task,  a  central  fixation  mark  appeared  followed  by  two  vertical 
lines,  one  to  either  side  of  fixation,  and  the  subject  had  to  make  a  speeded  keypress  indicatmg 
the  longer  of  the  two  lines.  The  mapping  between  stimuli  and  responses  was  compatible,  the 
left  key  indicated  the  left  line  and  the  right  key  indicated  the  riglit  line.  The  absolute  length  of 
the  lines  was  varied  across  trials,  and  the  difference  in  length  between  the  shon  and  long  lines 
was  roughly  proportional  to  absolute  line  length.  The  centers  of  the  two  lines  were  at  the  same 
heights,  but  the  horizontal  position  of  each  line  within  the  hemifield  was  randomly  determined 
on  each  trial.  Tlie  parameters  of  the  visual  stimuli  were  explored  during  pilot  work  to  find 
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values  that  made|the  task  as  difficult  as  possible  while  still  allowing  an  attentive  subject  to 
accurately  determine  which  line  was  longer. 

Results 


Distractor  Task  Performance.  Figure  6  shows  the  mean  response  times  and  accuracies  for 
the  line-length  discriminations  as  a  function  of  vowel  duration  and  formant  pattern.  Response 
times  were  significantly  longer  when  the  vowel  sound  was  300  msec  than  when  it  was  50  msec, 
F(l,ll)  =  8.1,  p  <  .02,  however,  there  were  significantly  fewer  errors  for  the  longer  vowels  than 
the  shorter  vowels  [F(l,ll)  =  16.0,  p  <  .005]  suggesting  a  speed-accuracy  tradeoff.  Formant 
pattern  also  had  a  significant  effect  on  response  times  [F(6,66)  =  3-0,  p  <  .02],  as  did  the 
interaction  between  duration  and  formant  pattern  [F(6,66)  =  2.6,  p  <  .05].  TTie  linear 
interaction  of  formant  pattern  and  vowel  duration  was  not  significant;  t(ll)  =  .66,  p  >  .2.  In 
this  experiment,  as  in  the  last  one,  the  effects  on  distractor  task  performance  of  the  speech 
sounds  are  weakly  related  to  the  phonetic  significance  of  the  acoustic  cues. 

Identification  of  Speech  Stimuli.  Figure  7  shows  the  mean  proportion  of  /i/  responses  as 
a  function  of  formant  pattern,  vowel  duration,  and  distractor  condition.  There  were  significant 
main  effects  of  formant  pattern  [F(6,66)  =  128.0,  p  <  .001]  and  vowel  duration  [F(l,ll)  =  41.9, 
p  <  .001],  as  well  as  a  significant  interaction  of  these  two  factors  [F(6,66)  =  10.5,  p  <  .001]. 
Duration  had  a  greater  impact  at  the  intermediate  formant  patterns  than  at  the  endpoints  of  the 
continuum,  t(ll)  =  6.7,  p  <  .001. 

Identilying  tlie  speech  sounds  while  simultaneously  performing  the  distractor  task 
diminished  the  phonetic  significance  of  the  formant  pattern;  F(6,66)  =  6.56,  p  <  .001.  The  net 
effect  of  duration  increased  from  13.5%  in  the  no-distractor  condition  to  18.0%  in  the  distractor 
condition,  but  this  difference  was  not  significant;  F(l,ll)  =  0.74.  The  three-way  interaction  of 
vowel  duration,  formant  pattern,  and  distractor  task  also  failed  to  reach  significance;  F(6,66)  = 
1.0.  However,  a  planned  contrast  showed  that  the  effect  of  vowel  duration  on  the  endpoint 
formant  patterns  was  greater  in  the  distractor  condition  than  in  the  no-distractor  condition, 
t(ll)  =  2.32,  p  <  .05. 

Discussion 

The  effect  of  the  line-length  discrimination  task  on  the  speech  identifications  was  similar 
to,  but  weaker  than,  the  effect  of  the  arithmetic  task  observed  in  Experiment  2.  Both  tasks 
caused  significant  decreases  in  the  effectiveness  of  formant  pattern  as  a  cue  to  vowel  identity . 
Both  tasks  resulted  in  a  net  increase  in  the  importance  of  duration  as  a  vowel  cue,  though  this 
effect  was  not  significant  with  the  line-length  task.  Both  tasks  caused  significant  increases  in  the 
effect  of  duration  for  the  formant  patterns  at  the  endpoints  of  the  continuum.  The  effects  of  the 
line-length  task  indicate  that  tasks  other  than  speeded  mental  arithmetic  can  influence  speech 
perception,  and  thus  satisfy  the  goal  of  increasing  methodological  generality  across  distractor 
tasks.  These  effects  also  indicate  that  a  heavy  verbal  short-term  memory  component  is  not  a 
necessary  condition  for  observing  that  a  concurrent  task  affects  speech  perception. 

The  smaller  impact  of  the  line-length  task  as  compared  to  the  arithmetic  task  could  result 
from  several  factors.  Perhaps  a  portion  of  the  impact  of  the  arithmetic  task  was  due  to  its 
reli.mce  on  shon-term  memory  or  some  other  processes  that  were  called  on  less  in  the  Ime- 
length  task.  Alternatively,  the  line-length  task  may  have  had  a  smaller  impact  because  it  was 
easier  than  the  arithmetic  task.  Tlie  mean  response  time  and  accuracy  for  the  Ime-length  task 
were  877  msec  and  90%  while  they  were  1434  msec  and  94%  for  the  arithmetic  task  in  the 
previous  experiment.-  Tlie  greater  ease  of  the  line-length  task  may  have  caused  it  to  draw,  less 
on  general  processing  resources  and  therefore  to  have  had  less  impact  on  concurrent  speech 
perception. 
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Experiment  4 

The  goal  of  this  experiment  is  to  provide  data  that  can  be  used  to  test  formal  models  of 
the  effect  of  attenriveness  on  the  phonetic  encoding  of  acoustic  cues.  The  initial  modeling  effort 
will  be  based  on  a  class  of  models  in  which  information  from  different  sources  is  combined 
independently  in  making  perceptual  judgments.  This  idea  is  embodied  in  both  signal  detection 
theory  and  Luce's  choice  theory,  which  have  very  similar  structures  and  which  often  make 
quantitatively  similar  predictions  (see  McClelland,  1991  for  a  cogent  and  relevant  review). 

These  models  have  enjoyed  successful  application  in  far-flung  domains  such  as  visual  letter 
recognition  (Oden,  1979),  judgments  of  social  traits  (Anderson,  1974),  and  semantic  judgments 
(Oden,  1977).  Most  relevant  to  the  present  endeavor,  the  Ftasy  Logical  Model  of  Perception, 
which  incorporates  Luce’s  choice  rule,  has  provided  excellent  accounts  of  many  results  in 
speech  perception  including  the  phonetic  integration  of  distinct  acoustic  information  and  the 
role  of  context  in  speech  perception  (e.g.,  Oden  &  Massaro,  1978,  Massaro,  1989).  The  present 
modeling  of  speech  perception,  though  based  on  these  successes,  will  employ  the  alternative 
signal  detection  formalism  because  it  is  quite  naturally  implemented  as  a  stochastic  interactive 
activation  model  (McClelland,  1991)  which  we  will  show  provides  a  good  account  of  the  role  of 
attention  in  recognizing  phonetic  segments. 

A  key  feature  of  these  independent  cue  models  is  that  each  perceptually  significant  feature 
of  a  stimulus  is  encoded  independently  of  the  other  features  of  the  stimulus  that  are 
simultaneously  present.  Therefore,  when  fitting  the  model  to  data  the  number  of  parameters  in 
the  model  equals  the  sum  of  the  number  of  feature  values  for  the  different  features,  while  the 
number  of  observations  in  a  factorial  design  equals  the  product  of  the  number  of  feature  values 
for  the  different  features.  Our  previous  experiments  have  therefore  not  allowed  a  meaningful 
test  of  the  model  because  of  the  low  ratio  of  observations  to  parameters.  The  present 
experiment  tackles  this  issue  by  using  the  vowel  stimuli  of  Experiments  2  and  3  but  increasing 
the  number  of  duration  levels  from  two  to  five,  resulting  in  35  speech  sounds.  The  experiment 
uses  the  arithmetic  distractor  task  of  Experiments  1  and  2  because  of  its  large  impact  on 
phonetic  encoding. 

Method 


Subjects.  Thirty-five  individuals  served  as  paid  subjects  in  a  single  experimental  session. 

Stimidi,  Design  and  Procedure.  The  vowel  stimuli  were  the  same  as  in  Experiments  2  and 
3  except  that  three  additional  vowel  durations  of  90, 120,  and  190  msec  were  added  to  the 
previously  used  durations  of  50  and  300  msec.  Combined  factorially  with  the  seven  formant 
frequencies,  this  yielded  a  total  of  35  vowel  sti  null.  A  session  consisted  of  one  block  of  the 
distractor  task  alone  for  practice  followed  by  8  experimental  blocks.  These  alternated  between 
the  no-distractor  and  distractor  conditions.  Each  block  had  35  trials  involving  one  presentation 
of  each  vowel  stimulus  in  a  random  order.  This  means  that  the  vowel  stimuli  at  the  different 
durations  were  mixed  rather  than  blocked  as  in  the  previous  experiments.  The  procedure  was 
otherwise  the  same  as  in  Experiments  2  and  3. 

Results 


The  mean  proportions  of  /i/  identifications  for  the  speech  sounds  in  the  no-distractor  and 
distractor  conditions  are  shown  in  Figure  8.  ,4s  expected,  there  were  significant  main  effects  of 
formant  pattern  [F(6,204)  =  263.4,  p  <  .001]  and  of  vowel  duration  [F(4,136)  =  76.7, 
p  <  .001].  There  was  a  significant  interaction  between  formant  pattern  and  whether  or  not  the 
distractor  task  was  performed,  [F(6,204)  =  20.6,  p  <  .001].  This  was  due  in  good  part  to  a 
decrease  in  the  importance  of  formant  pattern  when  the  distractor  task  was  being  performed  as 
shown  ’oy  the  linear  interaction  of  formant  pattern  with  distractor  task,  [F(l,34)  =  55.2, 
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p  <  .001].  There  was  also  a  significant  interaction  between  the  distractor  task  and  vowel 
duration;  [F(4,136)  =  13-0,  p  <  .001].  A  significant  interaction  between  the  linear  effect  of 
vowel  duration  and  distractor  condition  (F(l,34)  =  13.1,  p  <  .001]  showed  that  this  is  due  in 
good  part  to  increasing  importance  of  vowel  duration  when  the  distractor  task  was  being 
performed.  A  planned  test  showed  that  the  linear  effect  of  vowel  duration  was  greater  at  the 
formant  pattern  endpoints  in  the  distractor  condition  than  in  the  no-distractor  condition, 
[F(1,34)  =  24.2;  p  <  .001]. 

The  above  analyses  show  that  the  pattern  of  effects  observed  in  this  experiment  is  very 
similar  to  that  obser\'ed  in  the  previous  three  experiments. 

Independent  Cue  Models 

In  an  independent  cue  model  within  the  signal  detection  framework,  the  perceptual 
significance  of  each  level  of  a  feature  is  expressed  as  a  z-score.  This  score  represents  the 
distance  between  a  decision  criterion  and  the  mean  encoding  value  for  a  feature  in  units  of 
standard  deviations  of  the  encoding  distribution.  When  a  stimulus  contains  more  than  one 
feature,  its  overall  perceptual  value  is  given  by  the  sum  of  the  z-scores  for  the  features  it 
contains.  Therefore,  fitting  an  independent  cue  model  lo  the  present  data  involves  finding  the 
set  of  z-scores,  one  for  each  level  of  formant  pattern  and  vowel  duration,  that  minimizes  the 
squared  deviations  from  the  observed  response  probabilities  when  the  predicted  probabilities 
are  given  by  the  sum  of  the  cue  values  in  a  stimulus,  as  shown  in  the  following: 


PC/i/ISpjFk)  ”  ZCDF(ZDj  +  Zpjj). 

Here,  the  probability  of  responding  /i/  given  a  stimulus  S  with  duration  j  (Dj)  and  formant 
pattern  k  (Fj^)  equals  the  value  of  the  cumulative  normal  distribution  function  (ZCDF)  for  the 
sum  of  the  cue  values  (expressed.in  z-scores)  for  duration  j  (Z^j)  and  formant  pattern  k  (Zp0. 

As  an  initial  step,  separate  fits  were  found  for  the  no-distractor  and  distractor  conditions. 
The  resulting  parameter  values  are  shown  in  Table  2.  Looking  first  at  the  parameters  for  the  no- 
distractor  condition,  it  is  apparent  that  the  range  of  values  for  levels  of  formant  pattern  is 
greater  than  that  for  vowel  duration.  This  reflects  the  larger  perceptual  role  of  formant  pattern 
as  compared  to  vowel  duration.  The  fit  of  these  parameters  to  the  data  has  a  root-mean  square 
(RMS)  error  of  .027,  indicating  a  high  correspondence  between  the  predicted  and  observed 
response  probabilities.  By  comparison,  the  parameters  for  the  distractor  condition  show  a 
similar  pattern  but  are  for  the  most  part  reduced  in  absolute  value.  This  reduction  in  cue  values 
in  the  distractor  condition  reflects  a  reduction  in  the  signal-to-noise  ratio  in  encoding  phonetic 
cues  when  attention  level  is  diminished.  Thus,  attention  can  be  characterized  as  affecting  the 
signal-to-noise  ratio  in  encoding  phonetic  information.  Wlien  attention  is  focused  on  phonetic 
encoding,  as  in  the  no-distractor  condition,  higher  signal-to-noise  ratios  are  obtained.  When 
attention  is  not  focused  on  phonetic  encoding,  as  in  the  distractor  condition,  lower  signal-to- 
noise  ratios  are  obtained.  As  in  the  no-distractor  condition,  the  distractor  condition  shows  a 
greater  range  of  values  for  formant  patterns  than  foi  vowel  durations.  However,  the  difference 
IS  less  than  for  the  no-distractor  condition,  indicating  the  relative  importance  of  vowel  duration 
has  increased.  Another  noteworthy  aspect  of  these  results  is  that  the  fit  for  the  distractor 
condition  has,  an  RMS  error  of  .051  which  is  larger  than  that  of  the  no-distractor  condition.  This 
increased  error  probably  reflects  increased  variability  stemming  from  the  concurrent  tasks  aiid 
should  not  be  taken  as  evidence  against  the  model. 

The  fits  shown  m  Table  2  demonstrate  that  a  model  based  on  independently  combining 
information  from  the  formant  pattern  and  vowel  duration  cues  provides  a  viable  quantitative 
framework  for  the  present  results.  However,  because  separate  models  were  fit  for  the  no- 
distractor  and  distractor  conditions,  the  results  do  not  provide  a  unified  account  of  the  effect  of 
diminished  attention  on  the  phonetic  encoding  of  acoustic  cues.  Our  strategy  for  formulating 


Paying  Attention  to  Phonetic  Perception 


Page  16 


such  an  account  involves  exploring  how  the  ci|e  values  obtained  in  the  no-distractor  condition 
must  be  modified  in  order  to  fit  the  data  from  the  distractor  condition.  Different  ways  of 
modifying  the  cue  values  will  be  assessed  in  terms  of  how  well  they  fit  the  results  from  the 
distractor  condition.  Because  no  modification  of  the  parameters  obtained  in  the  no-distractor 
can  provide  a  better  fit  than  was  obtained  when  the  distraaor  results  were  fit  separately,  a 
ceiling  on  the  fit  is  given  by  the  RMS  error  of  .051  that  was  observed  for  that  model. 

The  most  straiglitforward  way  to  conceive  of  the  effect  of  attention  within  this  framework 
is  that  its  withdrawal  has  separate  effects  on  the  phonetic  encoding  of  formant  pattern  and 
vowel  duration.  This  can  be  implemented  simply  in  a  model  described  by  the  equation; 

(1)  P(/i/|SDjFk)  =  ZCDFfApZjjj  -f  ApZpj^  +  IQ. 

Here,  the  phonetic  value  of  the  vowel  durations  are  scaled  by  one  attention  parameter  A^  and 
those  of  the  formant  patterns  are  scaled  by  a  second  attention  parameter  Ap.  These  attention 
parameters  can  increase  or  diminish  the  perceptual  significance  of  the  cues  that  they  modify. 
The  resulting  modification*^  in  perceptual  significance  consist  of  linear  changes  in  the  signal-to- 
noise  ratios  for  phonetic  encoding  of  the  two  kinds  of  acoustic  information.  The  constant  K  in 
the  equation  must  be  inclu  led  in  the  model  because  the  phonetic  values  that  are  modified  (i.e., 
those  from  the  first  model  fit  to  the  no-distractor  condition,  top  of  Table  2)  include  an  arbitrary 
constant.  When  the  phonetic  values  for  formant  patterns  and  vowel  durations  are  multiplied  by 
different  attention  factors  the  value  of  the  constant  is  no  longer  arbitrary  and  must  be  included 
in  the  model.3  When  fit  to  the  data  from  the  no-distractor  condition,  this  model  has  an  RMS 
error  of  .069.  The  attention  parameter  is  .709  for  formant  pattern,  while  it  is  .936  for  vowel 
duration.  The  values  of  these  parameters  convey  important  information  about  what  happens  to 
the  encoding  of  the  two  kinds  of  acoustic  cues  when  listeners  are  not  able  to  pay  close  attention 
to  the  speech  stimulus.  With  regard  to  formant  p;.ttem,  the  value  of  .709  indicates  that  the 
distinctive  information  conveyed  by  different  leveL  of  formant  pattern  is  considerably  reduced 
under  low  attention  levels.  With  regard  to  vowel  duration,  the  value  of  .936  indicates  that  the 
distinctive  information  conveyed  by  different  levels  of  this  acoustic  cue  is  also  reduced 
somewhat.  The  analysis  of  the  effect  of  attention  on  vowel  duration  at  this  level  is  thus  quite 
different  from  one  that  looked  simply  at  the  observed  response  probabilities.  There,  it  would 
seem  that  the  importance  of  vowel  duration  actually  increased  when  listeners  could  not  pay 
close  attention  to  the  speech  stimulus.  Tlie  modeling  result  shows  that  this  increase  in 
importance  is  not  due  to  increased  distinctiveness  of  vowel  duration  but  rather  due  to  reduced 
competition  from  formant  pattern  because  of  the  apparently  greater  dependence  of  that  cue  on 
attentive  processing. 

While  the  analysis  using  separate  p.e.ameters  yields  some  important  information,  it  is  not 
entirely  satisfying  because  it  does  not  offer  ai«y  dues  u.  by  attentive  processing  might  be 
more  important  for  formant  pattern  than  for  vowel  ,'*ion.  A  model  that  treats  the  two  kinds 
of  cues  identically  could  potentially  provide  a  more  i.ojnplete  account  of  the  role  of  perception 
in  recognizing  phonetic  segments.  The  two  kinds  ^f  cues  would  have  equal  prior  standing  in  a 
model  with  only  one  attention  parameter,  such  as. 

(2)  PC/i/iSojFk)  =  ZCDFCAZoj  +  AZpk  K) 

where  A  lias  the  same  value  for  vowel  duration  and  formant  pattern.  Information  is  lost  from 
the  cue  values  obtained  in  the  focused  attention  condition  when  A  is  less  than  1,  and  the 
response  probabilities  move  toward  the  constant  response  bias  given  by  the  parameter  K.  Tins 
model  fits  the  data  with  an  RMS  error  of  .080  and  a  value  for  the  attention  parameter  of  .689. 
This  fit  is  not  as  good  as  for  the  previous  model,  but  it  is  achieved  with  one  fev/er  free 
parameter.  The  implication  of  this  finding  is  that  the  effect  of  diminished  attention  can  be 
charactenzed,  at  least  in  part,  as  a  general  diminution  of  phonetic  distinctiveness  independent 
of  its  acoustic  source.  Tlie  reason  that  this  model  can  fit  as  well  as  it  does  stems  from  a 
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somewhat  unintuitive  feature  of  the  way  that  sums  of  z-scores  map  onto  probabilities.  The 
impact  on  response  probabilities  of  .he  constant  phonetic  value  of  a  vowel  duration  diminishes 
as  the  absolute  magnitude  of  the  phonetic  value  of  the  associated  formant  pattern  increases.  As 
we  have  noted  several  times,  formant  pattern  has  substantial  phonetic  distinctiveness  leading  to 
relatively  large  phonetic  cue  values.  ^CTien  these  phonetic  cue  values  are  diminished  by  the 
attention  factor,  they  may  be  brought  into  .  'nge  in  which  the  phonetic  cue  value  of  a  vowel 

duration  has  a  greater  impact  on  response  . l>ilities  even  though  it  is  diminished  by  the 

same  factor. 


To  illustrate  this  point,  consider  Vv.„  ,  h  .. lO  the  predicted  response  probabilities  for 
the  two  stimuli  given  by  combining  the  firsi  f  .mant  pattern  with  the  firsi  two  vowel  durations. 
As  shown  by  Table  2,  the  cue  value  for  the  fl  t  formant  pattern  is  2.04,  while  it  is  -.838  and  -.374 
for  .he  first  two  vowel  durations  respectiveh  .'^•.■'■..ming  these  cue  values  gives  z-scoi  es  of  1.20 
and  1.66  which  yield  response  probabilities  886  and  .952.  Therefore,  the  net  effect  of  the 
yowel  duration  cue  on  response  probabikt ...  066.  V^Tien  these  response  z-scores  are 

multiplied  by  the  attention  factor  of  .689,  the  cue  value  for  the  formant  pattern  becomes  1.41 
and  those  for  the  two  vowel  durations  becocie  .58  and  .26.  These  yield  sums  for  the  stimuli  of 
.83  and  1.15,  resulting  in  response  probabilities  of  .796  and  .875-  T^us,  one  effect  of  reduced 
attention  in  this  model  is  to  mcrease  me  net  effect  of  vowel  duration  on  the  identification  of 
these  two  stimuli  from  .066  to  .079. 

The  above  model  suggests  the  possibility  that  attention  may  influence  vowel  duration  and 
formant  pattern  in  similar  ways,  but  that  the  magnitude  of  their  phonetic  distinctiveness  may  be 
the  basis  of  the  difference  apparent  in  the  response  probabilities.  The  next  model  takes  this 
possibility  one  step  further.  In  it,  the  effect  of  attention  on  cue  stre/.j^h  is  not  linear,  but  rather 
is  proportional  to  the  absolute  magnitude  of  the  cue  as  shown  below: 

(3)  p(/i/|SDjFK)  =  2CDF(ZDj •  AIZdjIZdj  +  Zpfc- AlZpklZfi,  -i-  K). 

Here,  the  cue  strength  for  e- .'  feature  value  is  decreased  by  the  product  of  an  attention  factor, 
the  absolute  value  of  the  cut  strength,  and  the  cue  strength.  This  model  fits  the  data  with  an 
RMS  error  of  .074  aiid  a  value  of  the  attention  parameter  (A)  of  .170.  This  improvement  over 
the  previous  model  indicates  that  low  attention  causes  an  accelerating  loss  of  information  as  the 
significance  of  a  cue  increases.  This  model  with  a  single  attention  parameter  achieves  a  fit  (.0!’4) 
that  is  not  far  off  that  (.069)  which  was  achieved  in  the  model  that  used  two  attention 
parameters  to  separately  fit  vowel  duration  and  formant  pattern.  This  latter  model  (Equation  3) 
seems  preferable  because  of  its  greater  parsimony  and  because  it  offers  the  possibilit)’  of  a 
unified  account  of  the  effect  of  attention  on  both  vowel  duration  and  foimant  pattern. 

The  effect  of  attention  has  been  incorporated  in  the  above  models  by  changing  the 
phonetic  values  of  the  acoustic  cues.  The  phonetic  values  are  given  in  a  scale  (z-scores)  that 
expresses  the  distance  between  a  decision  criterion  and  the  mean  of  the  encoding  distribution 
in  terms  of  the  width  of  the  encoding  distribution  (which  is  assumed  to  be  normal).  The 
diminished  phonetic  values  due  to  lo^v  anention  can  therefore  be  interpreted  mechanisticallv  as 
bemg  due  to  a  lessemng  of  the  distance  between  the  mean  of  the  encoding  distribution  and  the 
decision  criterion,  or  an  increase  the  variability  of  the  encoding  of  phonetic  values,  or  both 
This  leads  to  the  intuitively  appealing  idea  that  the  role  of  attention  in  recognizing  speech 
patterns  is  to  optimize  the  signal-to-noise  ratio  in  the  phonetic  encoding.  '  ac'usd*.  cues.  We 
pursue  a  specific  mechanistic  implementation  of  this  idea  below. 

Stochastic  Interactive  Activation 


In  a  recent  paper,  >  lassaro  (1989)  showed  that  interactive-activation  models  (McClelland 
&  Rumelhart,  1981,  .McClelland  &  Elman,  1986,  Rumelhart  &  McClelland.  1982),  as  they  were 


Paying  Attenrion  to  Phonetic  Perception 


Page  18 


I 

originally  put  forth.,  were  not  capable  of  accounting  for  the  large  number  of  a  dditive  cue  effects 
that  have  been  found  in  perceptual  (and  other  kinds  of)  research.  He  argued  that  the 
interactive  processLig  in  such  models  precluded  additive  integration  of  infonnation.  However, 
McClelland  (1991)  showed  that  the  structural  and  dynamical  assumptions  of  the  models  were 
compatible  with  such  effects,  but  not  in  conjunction  with  the  decision  rule  that  had  originally 
been  used  in  the  models.  The  original  rule  was  based  on  the  relative  activation  levels  of  the 
output  units.  Interactive-activation  models  can  exhibit  additive  cue  eifects  if  the  decision  rule  is 
changed  to  one  of  simply  selecting  the  most  active  output  unit.  Further,  it  must  be  assumed 
that  there  is  variability  in  either  the  stimulus  input  or  in  the  transmission  of  activation  between 
network  units.  This  noise  in  the  processing  network  means  that  there  can  bt  variability  across 
trials  in  which  output  node  is  most  highly  activated  and  consequently  in  ide'..ti.fications  of  a 
stimulus.  At  least  under  some  circumst  ’.nces,  such  a  model  will  exhibit  additive  cue  effects. 

The.s  Jochastic  interactive  activation  models  (McClelland,  1991)  constitute  an  important 
advanre  in  connectionist  mode  a  because  they  show  that  interactive  activation  models  can 
exhibit  additive  effects  and  because  the  assumption  of  noisy  proce-.  .-ing  in  the  network  is 
consistent  with  assertions  of  the  biological  plausibility  of  network  simulations  (Rumelhart  & 
McClelland,  1986).  The  models  Implement  additive  statistical  decision  models  in  a  fairly 
straightforward  way.  The  information  value  of  an  activation  level  is  relative  to  the  (assumedly 
normal  and  constant)  variability  in  the  network.  Thus,  activation  levels  can  easily  be  related  to  z- 
scores.  For  present  purposes,  stochastic  interactive  activation  offers  a  well  specified  mechanism 
that  exhibits  the  kind  of  additive  cue  effects  that  have  been  observed  here.  In  addition,  the 
structural  assumptions  of  interactive-activation  models  may  be  useful  in  accounting  for  the 
particular  way  in  which  attention  appears  to  operate  in  phonetic  encoding. 

Bounded  Activations.  According  to  the  model  shown  in  Equation  3,  attention  has  a 
nonlinear  effect  on  the  signal-to-noise  ratio  of  phonetic  encoding,  the  loss  of  attention  causes 
accelerating  distortion  (loss  of  information)  as  the  cue  values  increase.  One  way  in  which  a 
processing  mechanism  might  exhibit  this  increased  loss  of  fidelity  at  high  cue  values  would  be  if 
there  were  some  limits  on  the  re  escntational  capabilities  of  the  processing  units.  Such  limits 
exist  within  interactive-activation  models  in  the  form  of  the  bounds  on  Jic  activation  levels  that 
can  be  attained  by  nodes  in  a  network.  The  assumption  of  bounded  activ.idons  is  a  very 
common  one  and  is  computationally  important  for  many  of  the  attractive  prop.,.’^i?s  of  multi¬ 
stage  and  recurrent  networks  (Rumelhart,  Hinton,  &  McClelland,  1986).  Thus,  the  presence  of 
bounds  on  activation  is  independently  motivated  and  they  provide  a  mechanism  that  might 
produce  the  nonlinear  dependence  of  phonetic  .significance  on  attention  level  that  we  have 
observed. 

The  accelerating  loss  of  information  in  the  distractor  condition  could  occur  because  the 
incivased  variability  of  phonetic  encoding  due  to  reduced  attention  would  produce  a  greater 
number  of  extreme  activation  values  that  would  be  clipped  by  the  bounds  on  actitation  levels. 
Figure  9  illustrates  the  workings  of  this  process.  The  top  pan  of  the  graph  shows  density 
functions  for  the  phonetic  encoding  of  two  acoustic  cue  values,  one  moderate  and  the  other 
large.  The  variance  of  these  distributions  is  selected  so  that  the  occurrence  of  a  value  that 
exceeds  the  upper  or  lower  bounds  is  very  unlikely.  Thus  the  means  of  the  distributions 
roughly  equal  their  modes.  The  bottom  pan  of  the  graph  shows  the  effect  of  encoding  the 
same  two  acoustic  cues  with  greater  variability.  Both  functions  now  bump  into  the  upper 
bound  to  some  <-xtent,  which  produces  a  clipping  of  the  distribution.  That  would  cause  the 
means  of  the  distributions  to  shift  to  the  left  of  their  modes  (toward  the  neutral  point).  This 
occurs  to  a  greater  degree  for  the  stronger  cue  than  the  weaker  cue,  resulting  in  accelerating 
information  loss  as  the  modal  phonetic  value  of  an  acoustic  cue  increases. 

An  Interactive  Model.  A  stochastic-interactive  activation  model,  simulating  thi;  process, 
was  fit  to  the  speech  identifications  from  the  distractor  condition.  The  simulation  was 
essentially  analogous  to  the  model  given  in  Equation  3  except  that  the  accelerating  loss  of 
information  was  caused  by  bounds  on  activation  levels  rather  than  by  algebraic  fiat.  Figur ;  10 
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shows  the  organi2ation  of  the  network  that  was  used  in  the  simulation.  It  has  a  phonetic 
property  level  and  a  phonetic  segment  level.  The  property  level  contains  nodes  that  receive 
excitatory  external  input  and  encode  the  phonetic  significance  of  the  two  acoustic  dimensions, 
formant  pattern  and  vowel  duration,  for  both  /i/  and  /I/.  These  property  nodes  have 
reciprocal  excitatory  connections  with  their  parent  segment  nodes.  The  two  segment  nodes  are 
mutually  inhibitory.  The  design  of  this  small  network  was  meant  to  capture  as  simply  as 
possible  the  stimulus  dimensions  and  response  options  present  in  the  task,  while  building  in 
interactive  processing  similar  to  that  of  earlier  interactive  models  (McClelland  &  Rumelhart, 
1981;  McClelland,  1991;  Rumelhart  &  McClelland,  1982).  The  use  of  separate  property  nodes 
for  each  phonetic  segment  is  analogous  to  the  design  of  the  McClelland  and  Elman  (1986) 
application  of  interactive  activation  models  to  speech. 

The  operation  of  the  network,  with  one  key  exception,  is  the  same  as  McClelland’s  (1991) 
stochastic  interactive  activation  model.  All  nodes  start  a  trial  at  their  resting  level  and  external 
activations  are  applied  to  the  property  nodes.  The  activations  of  all  nodes  are  then  updated 
through  a  series  of  cycles  based  on  continuing  external  stimulation  and  the  activations  of  nodes 
to  which  they  have  weighted  connections.  After  a  specified  number  of  cycles,  the  response  for 
the  trial  is  given  by  the  phonetic  segment  node  that  has  the  highest  running  average  activation 
level.  Details  of  the  operation  of  the  network  are  given  in  the  Appendix.  The  innovation  of 
McClelland’s  (1991)  model  was  that  in  addition  to  the  regular  sources  of  activation,  the 
updating  of  each  node  on  a  cycle  included  a  noi.se  term  generated  from  a  normal  distribution 
with  a  mean  of  zero.  The  presence  of  this  noise  causes  variability  across  trials  in  the  output 
node  that  has  the  highest  activation.  The  magnitude  of  the  external  activations  relative  to  the 
noise  determines  the  network’s  performance  as  a  statistical  decision  process.  The  use  of  noise 
in  the  present  model  differed  from  McClelland’s  in  that  it  was  added  to  a  node’s  output  rather 
than  to  its  input.  The  output  of  a  node,  including  the  added  noise,  was  constrained  to  fall 
between  zero  and  the  node’s  maximum  activation  rate  of  me  (as  it  was  in  McClelland’s  model). 
If  the  output  exceeded  either  of  those  bounds,  it  was  set  to  the  bound.  This  provided  the 
mechanism  for  clipping  extreme  activations. 

The  simulation  was  fit  to  the  data  from  the  no-distractor  condition  in  the  following  way. 
The  external  input  to  the  phonetic  property  nodes  was  derived  from  the  cue  values  for  the  no- 
distractor  condition  shown  in  Table  2.  These  values  were  linearly  re-scaled  into  positive 
activation  values  that  were  distributed  around  .5  rather  than  0.  The  input  for  /!/  property 
nodes  was  set  to  1  minus  the  input  to  the  /i/  property  nodes.  The  slope  of  the  scaling  function 
was  a  free  parameter  in  the  model  and  played  a  critical  role  in  allowing  activation  bounds  to 
influence  network  output.  If  the  slope  were  very  small,  then  all  of  the  external  activations 
would  be  clustered  ti^tly  around  the  midpoint  (.5)  and  far  from  the  lower  and  upper  bounds 
pn  node  outputs  (0  and  1  respectively).  Thus,  a  small  slope  would  produce  little  or  no  clipping 
However,  a  larger  slope  brings  the  activation  levels  closer  to  the  bounds  and  would  allow 
clipping  to  play  a  role.  The  other  free  parameters  in  the  model  were  the  standard  deviation  of 
the  noise  and  a  bias  parameter.  Changes  in  the  standard  deviation  of  the  noise  affect  the  signal- 
to-noise  ratio  of  perceptual  encoding.  The  bias  parameter  is  analogous  to  the  constant  in 
Equation  3  and  was  implemented  by  adjusting  the  relative  resting  levels  of  the  two  phonetic 
segment  nodes.  The  response  of  the  network  on  a  trial  was  based  on  which  phonetic  segment 
node  had  the  highest  running  average  after  20  cycles.  The  network  was  run  through  a  1000 
trials  with  a  given  external  input  in  order  to  compute  response  probabilities.  The  best  possible 
fit  to  the  observed  response  propomons  in  the  distracior  condition  was  determined  by 
embedding  the  ner./ork  in  a  minimization  algorithm  (O’Neil,  1971)  that  searched  the  parameter 
space  for  the  optimum  configuration. 

The  simulation  Pt  the  observed  data  with  an  RMS  error  of  .075.  Tliis  is  nearly  identical  to 
the  fit  of  .074  obtained  through  Equation  3  which  allowed  for  accelerating  information  loss,  and 
is  better  than  the  fit  obtained  by  Equation  2  which  did  not  allow  for  accelerating  information 
loss.  Tile  fit  yielded  a  slope  for  the  activation  scaling  parameter  of  .17  and  a  standard  deviation 


Paying  Attention  to  Phonetic  Perception 


Page  20 


for  the  noise  distribution  of  .24.  As  a  consequence,  the  activation  values  for  the  phonetic  cue 
values  of  the  extreme  formant  patterns  were  subject  to  a  substantial  amount  of  clipping.  For 
example,  the  most  extreme  cue  value  (see  Table  2)  was  a  z-score  of  2.  ‘'X'hen  this  is  transformed 
into  an  activation  level  by  adding  .5  (the  middle  of  the  activation  range)  and  multiplying  by  the 
scaling  factor  of .  17,  the  result  is  a  modal  activation  of  .84.  Given  that  the  standard  deviation  of 
the  noise  distribution  is  .24,  this  is  .66  standard  deviations  away  from  the  upper  bound  on  the 
node's  output  activation.  Thus,  the  mean  output  activation  of  nodes  driven  by  this  input  will  be 
less  than  their  modal  value,  given  the  clipping  process  produced  by  the  upper  bound. 

In  order  to  show  that  clipping  by  the  bounded  activations  contributed  to  the  observed  fit, 
the  network  was  re-run  with  the  slope  of  the  scaling  function  fixed  at  .02  rather  than  at  the 
optimum  value  of  .17.  The  use  of  this  smaller  slope  means  that  clipping  of  the  output 
activations  by  the  activation  bounds  were  very  rare.  The  resulting  simulation  fit  the  data  with  an 
RMS  error  of  .08.  This  fit  is  identical  to  that  obtained  in  the  algebraic  model  (Equation  2)  in 
which  information  loss  was  proportional  to  cue  value.  The  difference  between  the  fit  of  this 
model  and  the  previous  one  demonstrates  that  clipping  by  the  activation  bounds  is  the  source 
of  the  accelerating  information  loss  that  occurs  in  the  distractor  condition. 

In  addition  to  showing  that  clipping  enhances  the  fit  to  the  distractor  data,  it  is  important 
to  show  that  clipping  does  nor  impair  simulation  of  the  data  from  the  no-distractor  condition. 

To  do  so,  the  slope  of  the  scaling  function  was  fixed  at  the  value  of  .17  obtained  from  the  best 
fit  to  the  distractor  data,  and  the  network  was  fit  to  the  no-distractor  data  by  changing  the 
standard  deviation  of  the  noise  distribution.  The  resulting  simulation  fit  the  no-distractor  data 
with  an  RMS  error  of  .028,  not  far  off  the  fit  of  .027  that  was  obtained  when  the  no-distractor 
data  were  fit  independently  by  an  algebraic  model  that  included  no  bounds.  The  standard 
deviation  of  the  noise  distribution  for  the  no-distractor  condition  was  .18,  which  is  a  third 
smaller  than  for  the  distractor  condition.  This  smaller  amount  of  noise  meant  that  there  was 
less  clipping  than  in  the  distractor  simulation,  and  enabled  the  network  to  accurately  model  the 
no-distractor  data. 

The  above  simulations  show  that  stochastic  interactive  activation  models  can  do  an 
excellent  job  of  instantiating  statistical  decision  models  that  additively  combine  different  sources 
of  information  (McClelland,  1991;  cf.  Massaro,  1989).  Further,  they  show  that  bounds  on 
activations  provide  a  viable  mechanism  for  producing  the  pattern  of  accelerating  information 
loss  that  had  been  shown  by  our  algebraic  models  to  provide  a  parsimonious  account  of  the 
effect  of  reduced  attention.  By  providing  an  independently  motivated  basis  with  which  to 
account  for  the  accelerating  loss  of  information,  the  structural  and  dynamic  characteristics  of 
interactive-aaivation  models  clearly  take  us  beyond  the  statistical  decision  models  in  our 
understanding  of  the  role  of  attention  in  phonetic  perception. 

Mechanisms  of  Enhancing  Signal-to-Noise  Ratios.  In  the  algebraic  models,  attention  level 
influences  the  signal-to-noise  rano  in  encoding  phonetic  information,  reduced  attention  lowers 
the  signal-to-noise  ratio  while  increased  attention  enhances  the  signal-to-noise  ratio.  In  the 
simulations  descnbed  above,  phonetic  information  was  represented  as  the  activation  level  of 
network  units  relative  the  noise  in  the  network.  Changes  in  signal-to-noise  ratio  due  to 
attention  level  were  achieved  by  holding  the  modal  level  of  activation  associated  with  an 
acoustic  cue  constant  and  vary  ing  the  amount  of  noise  in  the  network.  The  simulations  have  not 
motivated  this  choice  of  how  signal-to-noise  ratio  is  changed,  nor  have  they  specified  the 
mechanism  by  which  attention  level  influences  the  amount  of  variability  in  encoding.  Tliese 
issues  raise  interesting  questions  even  though  the  present  data  do  not  lead  to  definitive 
answers. 

One  way  attention  might  influence  noise  would  be  if  phonetic  encoding  on  a  trial  involved 
multiple  samples  of  a  stimulus  with  the  sampling  process  having  a  constant  variance. 
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independent  of  attention  level,  but  with  attention  level  influencing  the  extent  of  the  sampling. 

As  sampling  increased,  the  variance  of  the  mean  of  the  samples  would  decrease.  Because  it  is 
the  variance  of  the  mean  of  the  phonetic  value  for  an  acoustic  cue  that  determines  its 
information  value,  this  mechanism  could  produce  changes  in  signal-to-noise  ratio  without 
attention  directly  affecting  the  level  of  noise  in  the  network. 

An  interesting  way  in  which  this  might  occur  would  be  if  attention  controlled  the  duration 
of  the  link-up  between  acoustic  cues  and  phonetic  encoding  units.  There  would  be  less 
variability  in  the  mean  output  of  the  phonetic  unit  stimulated  over  a  long  period  of  time  than 
over  a  short  period  of  time.  Experiment  2  was  designed  to  test  something  like  this  possibility  by 
looking  for  a  differenrial  effect  of  reduced  attention  in  the  phonetic  encoding  of  spectral  cues  to 
formant  pattern  in  short  (50  msec)  and  long  (300  msec)  vowels.  The  short  vowels  would 
presumably  offer  less  opportunity  for  sampling  than  the  long  vowels,  which  could  compound 
problems  in  attention-based  temporal  hook-up  between  phonetic  units  and  acoustic  stimuli  or 
their  auditory  representations.  The  results  of  Experiment  2  (as  well  as  Experiments  3  and  4) 
showed  that  encoding  formant  information  was  not  differentially  difficult  in  the  short-duration  _ 
vowels,  thereby  providing  no  support  for  the  hypothesis.  However,  it  is  possible  that  the 
temporal  linkage  is  more  dependent  on  auditory  memory  than  on  the  physical  duration  of  the 
stimulus.  This  possibility  was  discussed  earlier  and  it  was  pointed  out  that  several  researchers 
have  argued  that  the  duration  of  an  auditory  memory  is  positively  related  to  the  duration  of  the 
acoustic  stimulus  (e.g.,  Fujasaki  &  Kawashima,  1969;  1970).  However,  this  analysis  was  based 
on  inferences  drawn  from  a  specific  model  of  the  relation  between  discrimination  and 
categorical  perception.  Subsequent  research  has  called  this  model  into  question  (Repp,  Healy, 
&  Crowder,  1979).  Further,  Pisoni  (1973)  showed  that  auditory  memory  for  short  and  long 
vowels  showed  similar  persistence  as  indicated  by  the  rate  at  which  performance  in  an  AX 
discrimination  task  decreased  with  increasing  delay  within  the  pair  of  stimuli.  Thus,  Experiment 
2  may  not  have  effectively  manipulated  the  persistence  of  information  in  auditory  memory, 
leaving  open  the  possibility  that  the  duration  of  acoustic  stimulation  of  phonetic  units  may  play 
an  important  role  in  determining  the  variability  of  phonetic  encoding. 

A  second  (and  not  exclusive)  way  that  attention  might  influence  signal-to-noise  ratios  in 
perceptual  encoding  is  by  modulating  the  amplification  of  signal  characteristics  at  a  stage  of 
processing  before  additional  noise  is  encountered  in  transmitting  stimulus  information.  Servan- 
Schreiber,  Printz  and  Cohen  (1990)  have  investigated  the  way  in  which  catecholamines  might  be 
related  to  signal  detection  behavior.  These  neuroactive  substances  have  been  found  to  increase 
the  responsiveness  of  individual  neurons  and,  in  studies  involving  drugs  or  pathological 
conditions,  to  influence  observable  signal  detection  performance.  Servan-Schreiber  et  al.  (1990; 
have  pointed  out  that  an  increase  in  a  cell’s  input-output  gain  can  not  of  itself  lead  to  improved 
signal-to-noise  performance.  The  increased  gain  will  apply  to  noise  in  the  input  as  well  as  the 
signal.  However,  increased  gain  by  a  unit  can  improve  a  network’s  signal-to-noise  performance 
if  that  gain  occurs  before  additional  noise  is  added  to  processing.  While  Servan-Schreiber  et  al. 
(1990)  show  that  this  truth  holds  for  any  strictly  increasing  gain  function,  they  investigate  it  ui 
detail  for  logistic  gain  functions,  which  they  treat  as  a  model  of  neural  activity.  These  functions 
are  S-shaped,  and  as  the  gain  parameter  increases  the  function  becomes  sharper,  approaching  a 
square  wave.  Tliis  family  of  gain  functions  can  be  used  to  model  the  operation  of  attention  in 
phonetic  encoding.  However,  its  performance  will  be  nearly  identical  to  the  algebraic  models 
developed  here  which  were  based  on  an  assumption  of  normally-distributed  noise  (see 
.McClelland,  1991).  Adjusting  the  gain  parameter  of  a  logistic  function  is  directly  analogous  to 
adjusting  the  attention  scaling  parameter  (A)  in  Equation  2,  which  modeled  the  effect  of 
attention  as  a  proponional  loss  (or  gain)  in  the  information  value  of  the  acoustic  cues.  Such 
models  do  not  produce  the  accelerating  loss  of  information  that  occurs  in  Equation  3  or  m  the 
network  model  developed  above.  Of  course,  the  Servan-Schreiber  et  al  (1990;  analysis  applies 
to  any  strictly  increasing  function.  In  its  domain  of  current  application.  Equation  3  produces  a 
strict  increase  in  information  for  the  no-distractor  as  compared  to  the  distractor  condition  and 
thus  could  be  taken  as  a  gain  function  underlying  anentional  amplification  of  signals.  However, 
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at  least  as  currently  developed,  models  of  attention  based  on  moderating  the  gain  of  a  netw  ork 
unit’s  response  do  not  offer  much  inoight  into  the  exact  form  of  information  modulation  that 
was  observed  in  the  present  studies  and  is  captured  in  Equation  3. 

Summary  of  the  Modeling 

The  analyses  above  demonstrate  that  the  role  of  attention  in  phonetic  perception  is  well 
accounted  for  by  models  in  which  anention  level  affects  the  signal-to-noise  ratio  in  the  phonetic 
encoding  of  acoustic  cues.  This  feature  is  shared  by  the  three  statistical  decision  models 
(Equations  1  -  3)  and  the  network  simulation.  These  models  differ  in  how  changes  in  the 
precision  of  phonetic  encoding  due  to  attention  interact  with  specific  acoustic-phonetic 
relations.  In  the  model  expressed  by  Equation  1,  attention  has  differential  effects  on  the 
formant  pattern  and  duration  cues  to  vowel  identity.  In  the  subsequent  models,  attention  has 
equivalent  effects  on  the  underlying  processing  of  these  two  cues,  but  the  inherent  strength  of 
the  acoustic-phonetic  relations  produces  the  attentional  differences  apparent  in  the  response 
probabilities.  The  effect  of  reduced  attention  in  the  first  of  these  models  (Equation  2)  is  to 
cause  a  proportional  loss  of  information  in  the  underlying  cue  representations.  In  the  next 
model  (Equation  3),  which  achieves  a  better  fit,  reduced  attention  cau.ses  accelerating  loss  of 
information  in  the  underlying  cue  representations.  The  network  simulation  produces  this 
accelerating  loss  of  information  because  increased  noise  in  network  activity  due  to  reduced 
attention  results  in  the  bounds  on  the  representational  capability  of  network  units  having  a 
significant  contribution  to  the  network’s  performance. 

The  extent  to  which  attention  level  produces  nonlinear  changes  in  perceptual  significance 
is  imponant  in  assessing  the  relative  merits  of  the  different  models.  The  models  incorporating 
this  feature  (Equation  3  and  the  network  simulation)  give  better  fits  to  the  data  than  does  the 
model  involving  proportional  change  in  perceptual  significance  (Equation  2).  Further,  one  of 
the  main  virtues  of  implementing  the  statistical  decision  model  as  a  stochastic  interactive 
activation  model  was  to  see  whether  principles  intrinsic  to  such  models  might  account  for  the 
nonlinear  effect  of  attention  level  on  phonetic  significance  that  was  described  but  unmotivated 
by  the  algebraic  model.  The  existence  of  bounds  on  the  maximum  and  minimum  activation 
rates  of  network  units  provided  an  independently  motivated  mechanism  that  can  produce  just 
such  an  effect.  This  demonstration  has  the  further  merit  of  indicating  an  important 
circumstance  in  which  stochastic  interactive  activation  models  will  not  behave  as  additive 
statistical  decision  models.  An  understanding  of  the  ability  of  parallel  network  models  to  exhibit 
both  additive  and  non-additive  effects  is  essential  to  the  evaluation  of  connectionist  models  of 
perception  (Massaro,  1989,  .McClelland,  1991).  For  these  reasons,  models  incorporating 
nonlinear  effects  of  attention  are  very  interesting  and  have  been  explored  in  detail.  However,  it 
is  important  to  note  that  the  improvement  in  fit  to  the  data  of  these  models  compared  to  the 
proponional  information  loss  model  is  not  large.  Application  of  these  principles  to  other  data 
sets  will  be  necessary  in  evaluating  the  ultimate  value  of  this  account  of  the  effect  of  attention  on 
perceptual  encoding. 


General  Discussion 

The  experiments  reported  here  have  shown  that  anention  plays  a  role  in  the  perception  of 
phonetic  segments  and  that  the  relative  importance  of  acoustic  cues  depends  on  the  amount  of 
anention  that  is  devoted  to  the  speech  stimulus.  Experiment  1  showed  that  the  strong  voicing 
cue  of  voice-onset  nme  decreased  in  phonetic  imponance  under  low  anention  levels,  while  the 
weak  voicing  cue  of  FO  onset  frequency  maintained  its  phonetic  contribution.  Experiment  2 
showed  that  the  strong  formant  panern  cue  to  the  distinction  between  /i/  and  /!/  also 
decreased  in  phonetic  importance  under  low  anention  levels,  w  hile  the  weak  cue  of  vowel 
duration  actually  increased  its  net  contribution  to  phonetic  perception.  Both  these  experiments 
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used  an  arithnn^tic  distractor  task  that  very  likely  placed  demands  on  short-term  verbal  memory . 
Experiment  3  showed  that  the  general  pattern  of  speech  perception  results  could  be  obtained 
when  the  distractor  task  consisted  of  a  line-length  discrimination  that  placed  less  demand  on 
short-term  memory.  Experiment  4  provided  a  large  data  set  with  which  to  test  quantitative 
models  of  the  role  of  attention  in  the  phonetic  encoding  of  acoustic  cues.  Below,  w  e  consider 
the  implications  of  these  findings  for  understanding  speech  perception.  Then,  various  facets  of 
the  current  modeling  are  reviewed  and  their  implications  for  general  issues  in  theories  of 
attention  are  discussed. 

Implications  of  the  Findings  for  Theories  of  Speech  Perception 

The  clearest  implication  of  the  present  results  is  that  patterns  of  phonetic  cue  importance 
obtained  under  conditions  of  focused  listening  should  not  be  taken  as  defin*tive.  It  seems  very 
likely  that  listeners  ordinarily  focus  less  attention  on  phonetic  perception  than  they  do  in  the 
laboratory  and  that  under  low  attention  conditions,  there  is  an  increased  contribution  of  weak 
cues  relative  to  strong  cues.  This  finding  forces  the  conclusion  that  speech  perception  is  more 
dependent  on  multiple  cues  than  was  previously  believed,  and  that  it  is  unlikely  that  a  single 
strong  cue  is  generally  dominant  in  recognition. 

This  conclusion  may  shed  some  light  on  the  difficulties  listeners  have  in  understanding 
synthetic  speech  (Pisoni  &  Hunnicut,  1980).  Pisoni  (1981;  discussed  in  Luce  et  al.,  1983)  has 
suggested  that  part  of  this  difficulty  stems  from  the  limitations  of  synthesis-by-rule  systems  in 
generating  the  large  variety  of  acoustic  cues  that  are  found  in  natural  speech.  Under  the  model 
outlined  here,  all  relevant  acoustic  cues  contribute  to  phonetic  perception.  A  single  strong  cue 
can  lead  to  high  levels  of  recognition,  but  only  if  careful  attention  is  given  to  the  stimulus.  Thus, 
synthetic  speech  which  successfully  encoded  only  the  strong  acoustic  cues  to  segment  identity 
would  place  heavy  demands  on  attentional  resources  in  order  to  be  recognized  successfully  - 
an  analysis  that  is  consistent  with  the  findings  of  Luce  et  al.,  (1983). 

In  a  less  artificial  vein,  these  findings  may  also  have  implications  for  theories  of  sound 
change.  Ohala  and  others  (Javkin,  1979;  Ohala,  1988)  have  articulated  a  model  of  sound  change 
^  arising  from  propagation  of  error  in  speech  communication  considered  as  a  transmission  line. 
According  to  this  view,  differences  may  occur  between  a  speaker’s  phonetic  intention  and  a 
listener’s  phonetic  perception  based  on  various  kinds  of  distortions  in  the  transmission  process. 
AXTien  the  listener  then  takes  a  turn  as  a  speaker,  these  distorted  phonetic  impressions  may  then 
be  introduced  into  the  process  as  novel  elements  of  phonetic  intentions.  Javkin  (1979) 
identifies  three  sources  of  transmission  distortions,  articulatory  errors,  acoustic  interference, 
and  biases  in  auditory  perception.  He  points  out  that  the  great  majority  of  phonological  analy  sis 
has  focused  on  the  role  of  articulatory  error,  though  he  produces  interesting  evidence  to 
support  a  role  for  auditory  processes  in  phonological  change.  The  present  analysis  of  the  role 
of  attention  in  phonetic  perception  offers  additional  clues  as  to  how  auditory  perception  might 
participate  in  phonological  change. 

The  view  of  phonetic  perception  adopted  here  considers  phonetic  encoding  of  acoustic 
cues  as  a  noisy  process,  and  shows  that  the  amount  of  encoding  noise  increases  under  low 
attention.  Tliis  process  has  the  interesting  consequence  of  increasing  the  net  contnbunon  to  a 
phonetic  percept  of  weak  cues  relative  to  strong  cues.  Low  attention  would  therefore  be  a 
factor  that  promotes  equalization  of  the  phonetic  values  of  acoustic  cues.  Of  course,  this 
process  can  not  explain  how  a  dominant  acoustic  cue  is  created  from  a  weak  one.  but  it  does 
point  to  how  an  originally  insignificant  acoustic  correlate  miglii  achieve  cnougli  phonetic  sUtus 
as  a  cue  that  it  might  be  acted  upon  by  other  forces  of  sound  change.  As  analyzed  here,  the 
effect  of  attention  on  phonetic  perception  would  be  neutral  with  regard  to  the  direction  of 
sound  change;  the  addition  of  noise  is  symmetric  and  does  not  push  the  perception  of  a 
phonetic  segment  in  any  particular  direction.  However,  noisy  phonetic  encoding  due  to  low 
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attention  could  combine  with  directional  forces  to  accelerate  sound  change.  These  directional 
forces  could  be  internal  to  the  perceiver,  as  in  the  cases  of  perceptual  biases  studied  by  Javkin 
(1979).  Or,  they  could  derive  from  effects  of  phonetic  context  on  the  acoustic  structure  of 
phonetic  segments. 

Our  analysis  of  the  present  results  has  characterixed  attention  as  operating  at  a  phonetic 
rather  than  auditory  level  of  processing.  This  characterization  results  from  the  finding  that  the 
dependence  of  an  acoustic  cue  on  attentive  processing  relates  to  its  potential  phonetic  strength 
This  provides  the  most  straightforward  unification  of  our  results  concerning  the  VOT  and  FO 
onset  frequency  cues  to  stop  consonant  voicing  and  the  formant  pattern  and  duration  cues  to 
vowel  identity.  In  both  cases,  the  differential  effects  of  attention  were  related  to  the  magnitude 
of  the  potential  phonetic  strength  of  the  acoustic  cues.  The  strong  cues,  VOT  and  formant 
pattern,  required  close  attention  to  achieve  their  commanding  phonetic  significance.  The  weak 
cues,  FO  onset  frequency  and  vowel  duration,  did  not  require  careful  attention  in  order  to 
maintain  their  potential  phonetic  significance.  Further,  several  models  of  the  results  of 
Experiment  4  (Equations  2  and  3  and  the  network  simulation)  show  how  attentional  processing 
at  a  phonetic  level  could  interact  with  phonetic  strength  in  order  to  produce  the  observed 
results.  Given  the  present  findings,  this  phonetic  level  of  explanation  is  much  more  complete 
than  the  alternative  of  accounting  for  the  role  of  attention  in  the  auditory  processing  of  each  of 
the  individual  acoustic  cues.  For  this  reason,  we  prefer  a  phonetic  level  of  explanation  at 
present. 

Still,  it  is  worth  considering  the  possibility  that  properties  specific  to  the  individual 
acoustic  cues  account  for  the  differential  effect  of  attention.  The  best  fit  to  the  distractor  data  of 
Experiment  4  was  obtained  by  the  model  (Equation  1)  that  used  different  attention  parameters 
to  scale  the  information  loss  for  the  formant  pattern  and  duration  cues  to  vowel  identity.  This  is 
not  surprising  since  the  model  included  an  additional  free  parameter  compared  to  the  latter 
models.  Further,  a  difference  in  the  processing  of  the  two  cues  can  not  be  explained  by  such  a 
model  per  se  because  the  model  builds  in  differential  treatment  of  the  cues.  Motivation  for  this 
differential  treatment  must  come  from  outside  the  model.  This  leads  to  the  question  of 
whether  there  are  any  psychoacoustic  factors  that  might  make  formant  pattern  and  VOT  depend 
heavily  on  close  attention,  whereas  vowel  duration  and  FO  onset  frequency  do  not.  On  the  face 
of  it,  there  is  little  to  suggest  that  this  is  the  case.  Of  the  attention-dependent  cues,  formant 
pattern  is  primarily  a  spectral  cue  while  VOT  is  primarily  a  temporal  cue.  Similarly,  for  the  cues 
that  are  not  attention  dependent,  FO  onset  frequency  is  a  spectral  cue  while  vowel  duration  is  a 
temporal  cue.  In  making  this  comparison,  we  do  not  mean  to  imply  that  formant  pattern  and  FO 
onset  frequency  are  processed  by  one  auditory  mechanism  for  spectral  processing  while  vowel 
duration  and  VOT  are  processed  by  a  separate  auditory  mechanism  for  temporal  processing. 

We  only  mean  to  point  out  that  dependence  on  attentive  processing  does  not  relate  to  the  most 
obvious  dimension  of  psychoacoustic  similarity  among  the  acoustic  cues  that  we  studied. 

Unless  developed  in  unforeseen  directions,  this  kind  of  explanation  does  not  compete  with 
explanations  based  on  potential  phonetic  strength. 

There  is,  however,  one  other  dimension  of  analysis  that  correlates  with  degree  f  attention 
dependence.  In  addition  to  being  cues  to  segment  identity,  both  vowei  duration  and  FO  onset 
frequency  are  prosodic  cues,  while  formant  pattern  and  VOT  are  not.  A  great  deal  of  research 
indicates  that  prosodic  panems  play  an  important  role  in  the  process  of  selecting  a  speech 
signal  because  they  provide  some  continuity  of  the  signal  over  time  (Bregman,  1990). 
Accordingly,  a  first  goal  of  the  speech  perception  system  may  be  to  latch  onto  prosodic  cues 
because  they  provide  a  basis  for  continuing  signal  selection  (in  general,  if  not  in  the  laboratory 
perception  of  monosyllables).  If  this  were  the  case,  prosodic  cues  might  be  processed  for  a 
longer  time  than  non-prosodic  cues,  giving  them  an  advantage  for  accurate  encoding  (recall  that 
one  way  in  which  the  signal-to-noise  ratio  of  perceptual  encoding  can  be  reduced  is  by 
increasing  the  size  of  the  sample  of  the  signal).  When  attention  can  be  focused  on  speech 
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perception,  this  early  advantage  may  be  ovencome  by  sustained  sampling  of  the  signal  which 
would  allow  the  full  significance  of  the  non-prosodic  cues  to  be  encoded.  When  attenrion  is  not 
devoted  exclusively  to  speech  •Perception,  the  non-prosodic  cues  may  not  be  encoded  with  the 
precision  necessary  to  allow  t  a  to  overcome  their  initial  disadvantage  relative  to  the  prosodic 
cues.  As  noted  above,  this  acc  .nt  is  speculative.  Its  main  appeal  is  that  it  potentially  unifies 
two  facets  of  attention:  selection  and  processing  facilitation.  However,  until  it  receives  further 
investigation  we  prefer  the  phonetic  level  of  explanation  advanced  earlier. 

One  characteristic  of  all  of  the  models  described  above  is  that  attention  operates  early  in 
perceptual  processing.  This  characteristic  is  shared  with  other  work  indicating  a  role  of 
attention  in  early  selection  of  perceptual  inputs  (Broadbent,  1958;  Kleiss  &  Lane,  1986;  Pashler, 
1984).  It  is  not  shared  by  theories  that  stress  that  attentional  selection  occurs  late  in  processing 
(Deutsch  &  Deutsch,  1963;  Duncan,  1980;  Shiffrin  &  Gardner,  1972).  The  findings  that  motivate 
the  current  modeling  are  not  consistent  with  the  notion  that  the  extraction  of  perceptual 
features  is  preattentive  (Neisser,  1967).  That  idea  is  embodied  in  current  theories  that  propose 
that  featurd  information  is  extracted  automatically  and  in  parallel,  and  that  the  role  of  attention 
in  perception  is  to  integrate  features  into  objects  (Treisman  &  Gelade,  1980).  The  present  work 
indicates  that  the  accuracy  with  which  features  are  encoded  depends  on  the  available 
attentional  resources  and  describes  a  mechanism  that  manifests  this  dependence.  We  believe 
that  this  demonstrates  some  good  reasons  for  paying  attention  to  phonetic  perception. 
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Appendi"/. 

The  operation  of  the  network  follows  the  general  principles  discussed  in  Ruinelhart  et 
al.  (1986)  and  followed  in  McClelland  (1991).  At  the  :  ^-ginning  of  a  trial,  all  nodes  are  set  tu  their 
resting  levels  and  appropriate  external  activations  are  applied  to  relevant  nodes.  At  each  time 
step  in  processing,  the  input  to  a  node  i  is  computed  follows: 

neti  =  L  wjioj  +  exti. 

Here,  Wji  is  the  weight  of  the  connection  from  node  ;  to  node  i,  oj  is  the  output  of  node  j,  and 
exti  is  the  external  input.  Given  the  net  input,  the  activation  of  a  node  is  updated  using  the 
Rumelhart-McClelland  rule: 

If  {netj  >  0)  then: 


A(ai)  =  KM  -  ai)neti  -  D(ai  -  r) 


or  else: 


A(ai)  =  I(ai  -  m)neti  -  D(ai  -  r) 


The  constant  M  is  the  maximum  activation  rate,  m  is  the  minimum  activation  rate,  r  is  the 
resting  activation  level,  and  I  and  D  respectively  scale  the  effects  of  input  and  decay.  The  values 
for  the  various  constants  were  taken  from  McClelland  (1991)  who  characterized  them  as  generic. 
Excitatory  connections  were  set  at  weights  of  1,  while  inhibitory  connections  had  weights  of  -1,  M 
=  1;  m  =  -.2;  r  =  -.1;  /  =  .1  and  D  =  .1.  The  final  response  was  selected  as  the  output  unit  with  the 
highest  "running  average",  using  the  following  formula  (McClelland,  1991): 

ai(t)=a)oi(t)  +  (l-;i)ai(t-l) 

where  a  equals  the  running  average,  t  equals  the  time  step,  o  is  a  unit’s  output,  and  the 
parameter  A  is  set  to  0.05. 
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Table  1:  Vowel  Formant  Frequencies.  These  are  the  center  frequencies  of  the  first  thre< 
formants  of  the  seven-member  continuum  from  /i/  to  /!/.  The  fourth  and  fifth  formants  were 
set  at  3500  Hz  and  4500  Hz  respectively.  The  stimuli  were  closely  modeled  after  ones  used  by 
Pisoni  (1975). 
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